The following examples focus on loading COG data into xarray object backed by Dask for efficient in memory reading and processing.
Info
Xarray is build on NumPy and Pandas, adding capabilities for labeled and multi-dimensional arrays (e.g., climate data, satellite images). It extends NumPy arrays by attaching metadata (coordinates, labels), making it easier to work with data dimensions like time, latitude, longitude, and other variables.
Xarray can use Dask arrays for lazy evaluation, enabling work with large datasets that don't fit in memory. Dask optimizes workflows by parallelizing tasks, reading data in chunks, and improving performance and memory efficiency.
Load a single COG, using the link to the s3 object, into an xarray.
Rioxarray is based on rasterio, and can be used to read data into Xarray object. The developers of the rioxarray library provide additional usage examples, like this one.
pipinstallrioxarray
# Dask is needed to be able to use the rioxarray chunks # https://docs.dask.org/en/stable/install.htmlpipinstall"dask[complete]"
When using rioxarray.open_rasterio() set chunks to enable lazy loading with Dask. This allows Dask to read data in smaller chunks, improving speed and memory usage through parallel computing. For example, a chunk size of 1000 for both x and y means Dask reads 100x100 boxes instead of the entire array, processing multiple chunks simultaneously.
importpystac_clientimportrioxarraybbox=[-75.8860,45.3157,-75.5261,45.5142]bbox_crs="EPSG:4326"# Link to ccmeo datacube stac-apistac_root="https://datacube.services.geo.ca/stac/api"# Initialize the STAC clientcatalog=pystac_client.Client.open(stac_root)search=catalog.search(collections=['mrdem-30'],bbox=bbox,)# Use the rioxarray.open_rasterio() and clip it to the bboxforpageinsearch.pages():foriteminpage:band=rioxarray.open_rasterio(item.assets['dtm'].href,chunks=512,).rio.clip_box(*bbox,crs=bbox_crs)# At this point the data is not read in the Xarray object# See the Xarray object detailsprint(band)# <xarray.DataArray (band: 1, y: 999, x: 1134)> Size: 5MB# dask.array<getitem, shape=(1, 999, 1134), dtype=float32, chunksize=(1, 512, 512), chunktype=numpy.ndarray># Coordinates:# * band (band) int64 8B 1# * x (x) float64 9kB 1.493e+06 1.493e+06 ... 1.527e+06 1.527e+06# * y (y) float64 8kB -1.557e+05 -1.558e+05 ... -1.857e+05 -1.857e+05# spatial_ref int64 8B 0# Attributes:# TIFFTAG_DATETIME: 2024:12:12 12:00:00# AREA_OR_POINT: Area# scale_factor: 1.0# add_offset: 0.0# _FillValue: -32767.0# At this point, the metadata and array shape are set, but the data itself isn't read.# Running .compute() allows Dask to optimize the workflow, evaluating and executing it # in the most efficient way, optimizing resource usage.# Perform analysis...# To read the databand.compute()
The following libraries are not part of the STAC utils ecosystem. Be aware of the maintaining status of those libraries.
Warning
GDAL's GetGeoTransform and rasterio use different formats for transform metadata. When using GDAL based method you need to re-order the transform.
See Re-order the STAC proj:Transform for more details.
defreorder_transform(gdal_transform):""" Reorders the GDAL GeoTransform (6-element tuple) into the 9-element format that is compatible with proj:transform. """return[gdal_transform[1],gdal_transform[2],gdal_transform[0],gdal_transform[4],gdal_transform[5],gdal_transform[3],0,0,1]
importpystac_clientimportstackstac# Define a bounding box for an AOI in EPSG:4326bbox=[-71.2155,45.4012,-71.1079,45.5083]bbox_crs="EPSG:4326"# Link to ccmeo datacube stac-apistac_root="https://datacube.services.geo.ca/stac/api"catalog=pystac_client.Client.open(stac_root)search=catalog.search(collections=['hrdem-mosaic-1m'],bbox=bbox,)# Re-order the proj:transformresult_items=[]# Use the pagination to improve efficiency# Iterate over returned page and update the transform for each itemsforpageinsearch.pages():foriteminpage:# reorder_transform() function is define abovereordered_transform=reorder_transform(item.properties["proj:transform"])item.properties["proj:transform"]=reordered_transformresult_items.append(item)# Importing unnecessary assets may cause memory and speed issues.# To know the assets available in this collection :print(result_items[0].assets.keys())# >> dict_keys(['dsm', 'dtm', 'dsm-vrt', 'dtm-vrt', 'thumbnail', # 'hillshade-dsm', 'hillshade-dtm', 'extent',# 'coverage',])items_xarray=stackstac.stack(result_items,assets=["dsm","dtm"],bounds_latlon=bbox,chunksize=(1000,1000),epsg=3979,properties=False)print(items_xarray)# >>> items_xarray # <xarray.DataArray 'stackstac-a96f69db1a66a462b3fb7c84c412aa7e' (time: 1,# band: 2,# y: 8812, # x: 7581)> Size: 1GB# dask.array<fetch_raster_window, shape=(1, 2, 8812, 7581), dtype=float64, chunksize=(1, 1, 1000, 1000), chunktype=numpy.ndarray># Coordinates:# * time (time) datetime64[ns] 8B 2023-05-09T12:00:00# id (time) <U13 52B '9_2-mosaic-1m'# * band (band) <U3 24B 'dsm' 'dtm'# * x (x) float64 61kB 1.843e+06 1.843e+06 ... 1.855e+06 1.855e+06# * y (y) float64 70kB -3.925e+04 -3.925e+04 ... -5.362e+04# title (band) <U27 216B 'Digital Surface Model (COG)' 'Digital Terr...# description (band) <U61 488B 'Digital Surface Model derived fromAirborne...# epsg int64 8B 3979# Attributes:# spec: RasterSpec(epsg=3979, bounds=(1842906.2779342758, -53625....# crs: epsg:3979# transform: | 1.63, 0.00, 1842906.28|\n| 0.00,-1.63,-39249.98|\n| 0.0...# resolution_xy: (1.6327081635433747, 1.6314054662236352)# At this point, the metadata and array shape are set, but the data itself isn't read.# Running .compute() allows Dask to optimize the workflow, evaluating and executing it # in the most efficient way, optimizing resource usage.
importpystac_clientimportodc.stac# Define a bounding box for an AOI (Ottawa) in EPSG:4326bbox=[-75.8860,45.3157,-75.5261,45.5142]bbox_crs="EPSG:4326"# The landcover collection has a meaningful temporal dimension# In this example, we filter from 2015 to 2020date_range='2015-01-01/2020-12-31'# Link to ccmeo datacube stac-apistac_root="https://datacube.services.geo.ca/stac/api"catalog=pystac_client.Client.open(stac_root)search=catalog.search(collections=['landcover'],bbox=bbox,datetime=date_range,)# Re-order the proj:transformresult_items=[]# Use the pagination to improve efficiency# Iterate over returned page and update the transform for each itemsforpageinsearch.pages():foriteminpage:# reorder_transform() function is define abovereordered_transform=reorder_transform(item.properties["proj:transform"])item.properties["proj:transform"]=reordered_transformresult_items.append(item)# Importing unnecessary assets may cause memory and speed issues.# To know the assets available in this collection :print(result_items[0].assets.keys())# >> dict_keys(['classification', 'thumbnail'])items_xarray=odc.stac.load(result_items,chunks={"x":512,"y":512},bands=["classification"],# Asset to loadbbox=bbox,)print(type(items_xarray))# >> <class 'xarray.core.dataset.Dataset'># See https://docs.xarray.dev/en/stable/api.html#dataset# At this point, the metadata and array shape are set, but the data itself isn't read.# Running .compute() allows Dask to optimize the workflow, evaluating and executing it # in the most efficient way, optimizing resource usage.
# Example : Get the mean surface height for the AOI# 1. Subtract the dtm to the dsm# Dimension are [time, band, x, y], and band are : [dsm, dtm]items_xarray['surface_height']=items_xarray[:,0,:,:]-items_xarray[:,1,:,:]# items_xarray[:, 0, :, :] -> Getting all from dimension time, x and y, and only element 0 from dimension band# 2. Compute the calculation, before that, only the workflow was definedsurface_height=items_xarray.surface_height.compute()# 3. Perform the meansurface_height.mean()