Skip to content

Additional Libraries

The following examples focus on loading COG data into xarray object backed by Dask for efficient in memory reading and processing.

Info

Xarray is build on NumPy and Pandas, adding capabilities for labeled and multi-dimensional arrays (e.g., climate data, satellite images). It extends NumPy arrays by attaching metadata (coordinates, labels), making it easier to work with data dimensions like time, latitude, longitude, and other variables.

Xarray can use Dask arrays for lazy evaluation, enabling work with large datasets that don't fit in memory. Dask optimizes workflows by parallelizing tasks, reading data in chunks, and improving performance and memory efficiency.

Source : Xarray: Parallel Computing with Dask

Note

The following examples starts with a request to the CCMEO STAC API via the pystac-client library.

For more details on how to discover data through the STAC API, see the Interacting with CCMEO STAC API section

Using Rioxarray

Load a single COG, using the link to the s3 object, into an xarray.

Rioxarray is based on rasterio, and can be used to read data into Xarray object. The developers of the rioxarray library provide additional usage examples, like this one.

pip install rioxarray
# Dask is needed to be able to use the rioxarray chunks 
# https://docs.dask.org/en/stable/install.html
pip install "dask[complete]" 
Source : rioxarray installation

Read a portion of a remote COG into an xarray

In this code you will :

  • Query a STAC API with pystac-client to get link to a COG;
  • Use rioxarray to read header of remote COG;
  • Open the remote COG by chunks;
  • Read an area of interest into an in memory Xarray object

Info

This specific example uses the collection hrdem-lidar from CCMEO's datacube

Warning

By default, the bbox used for clipping data inside rio.clip_bbox() needs to be in the projected coordinate system.
See https://corteva.github.io/rioxarray/stable/examples/clip_box.html#Clip-using-a-bounding-box

Note

When using rioxarray.open_rasterio() set chunks to enable lazy loading with Dask. This allows Dask to read data in smaller chunks, improving speed and memory usage through parallel computing. For example, a chunk size of 1000 for both x and y means Dask reads 100x100 boxes instead of the entire array, processing multiple chunks simultaneously.

import pystac_client
import rioxarray

bbox=[-75.8860,45.3157,-75.5261,45.5142] 
bbox_crs = "EPSG:4326"

# Link to ccmeo datacube stac-api
stac_root = "https://datacube.services.geo.ca/stac/api"
# Initialize the STAC client
catalog = pystac_client.Client.open(stac_root)

search = catalog.search(
    collections=['mrdem-30'], 
    bbox=bbox,
    ) 

# Use the rioxarray.open_rasterio() and clip it to the bbox
for page in search.pages():
    for item in page:
        band = rioxarray.open_rasterio(
            item.assets['dtm'].href, 
            chunks=512,
            ).rio.clip_box(*bbox,crs=bbox_crs)

# At this point the data is not read in the Xarray object
# See the Xarray object details
print(band)
# <xarray.DataArray (band: 1, y: 999, x: 1134)> Size: 5MB
# dask.array<getitem, shape=(1, 999, 1134), dtype=float32, chunksize=(1, 512, 512), chunktype=numpy.ndarray>
# Coordinates:
#   * band         (band) int64 8B 1
#   * x            (x) float64 9kB 1.493e+06 1.493e+06 ... 1.527e+06 1.527e+06
#   * y            (y) float64 8kB -1.557e+05 -1.558e+05 ... -1.857e+05 -1.857e+05
#     spatial_ref  int64 8B 0
# Attributes:
#     TIFFTAG_DATETIME:  2024:12:12 12:00:00
#     AREA_OR_POINT:     Area
#     scale_factor:      1.0
#     add_offset:        0.0
#     _FillValue:        -32767.0

# At this point, the metadata and array shape are set, but the data itself isn't read.
# Running .compute() allows Dask to optimize the workflow, evaluating and executing it 
# in the most efficient way, optimizing resource usage.

# Perform analysis...

# To read the data
band.compute()
See working-with-xarray-object or community-notebook-complete-examples section for an example on using Xarray object.

From pystac item object to Xarray object.

See pystac.Item or pystac-client for more information.

Warning

The following libraries are not part of the STAC utils ecosystem. Be aware of the maintaining status of those libraries.

Warning

GDAL's GetGeoTransform and rasterio use different formats for transform metadata. When using GDAL based method you need to re-order the transform. See Re-order the STAC proj:Transform for more details.

def reorder_transform(gdal_transform):
    """
    Reorders the GDAL GeoTransform (6-element tuple) into the 9-element format
    that is compatible with proj:transform.
    """
    return [gdal_transform[1], gdal_transform[2], gdal_transform[0],
            gdal_transform[4], gdal_transform[5], gdal_transform[3],
            0, 0, 1]

For more information, please see STAC documentation on proj:transform

Using stackstac

This is a third party library based on Xarray, but not listed under the Xarray documentation.

pip install stackstac[viz]
If you have problems with the installation, please refer to stackstac installation.

Load pystac items into an Xarray

In this code you will :

  • Query a STAC API with pystac-client to get a pystac-item object;
  • Load the pystac-item object into an Xarray.DataArray using stackstac
  • Stream the actual data in memory with dask workflow included in stackstac

Info

This specific example uses the collections hrdem-mosaic-1m from CCMEO's datacube

import pystac_client
import stackstac

# Define a bounding box for an AOI in EPSG:4326
bbox=[-71.2155,45.4012,-71.1079, 45.5083]
bbox_crs = "EPSG:4326"

# Link to ccmeo datacube stac-api
stac_root = "https://datacube.services.geo.ca/stac/api"
catalog = pystac_client.Client.open(stac_root)

search = catalog.search(
    collections=['hrdem-mosaic-1m'], 
    bbox=bbox,) 

# Re-order the proj:transform
result_items = []
# Use the pagination to improve efficiency
# Iterate over returned page and update the transform for each items
for page in search.pages():
    for item in page:
        # reorder_transform() function is define above
        reordered_transform = reorder_transform(
                                item.properties["proj:transform"]
                                )
        item.properties["proj:transform"] = reordered_transform
        result_items.append(item)

# Importing unnecessary assets may cause memory and speed issues.
# To know the assets available in this collection :
print(result_items[0].assets.keys())
# >> dict_keys(['dsm', 'dtm', 'dsm-vrt', 'dtm-vrt', 'thumbnail', 
#               'hillshade-dsm', 'hillshade-dtm', 'extent',
#               'coverage',])

items_xarray = stackstac.stack(result_items, 
                          assets = ["dsm", "dtm"], 
                          bounds_latlon = bbox, 
                          chunksize = (1000, 1000),
                          epsg = 3979,
                          properties=False)

print(items_xarray)
# >>> items_xarray 
# <xarray.DataArray 'stackstac-a96f69db1a66a462b3fb7c84c412aa7e' (time: 1,
#                                                                 band: 2,
#                                                                 y: 8812, 
#                                                                 x: 7581)> Size: 1GB
# dask.array<fetch_raster_window, shape=(1, 2, 8812, 7581), dtype=float64, chunksize=(1, 1, 1000, 1000), chunktype=numpy.ndarray>
# Coordinates:
#   * time         (time) datetime64[ns] 8B 2023-05-09T12:00:00
#     id           (time) <U13 52B '9_2-mosaic-1m'
#   * band         (band) <U3 24B 'dsm' 'dtm'
#   * x            (x) float64 61kB 1.843e+06 1.843e+06 ... 1.855e+06 1.855e+06
#   * y            (y) float64 70kB -3.925e+04 -3.925e+04 ... -5.362e+04
#     title        (band) <U27 216B 'Digital Surface Model (COG)' 'Digital Terr...
#     description  (band) <U61 488B 'Digital Surface Model derived fromAirborne...
#     epsg         int64 8B 3979
# Attributes:
#     spec:           RasterSpec(epsg=3979, bounds=(1842906.2779342758, -53625....
#     crs:            epsg:3979
#     transform:      | 1.63, 0.00, 1842906.28|\n| 0.00,-1.63,-39249.98|\n| 0.0...
#     resolution_xy:  (1.6327081635433747, 1.6314054662236352)

# At this point, the metadata and array shape are set, but the data itself isn't read.
# Running .compute() allows Dask to optimize the workflow, evaluating and executing it 
# in the most efficient way, optimizing resource usage.

See working-with-xarray-object or community-notebook-complete-examples section for an example on using Xarray object.

Using odc-stac

This is a third party library based on Xarray, but not listed under the Xarray documentation.

pip install odc-stac
If you have problems with the installation, please refer to odc-stac installation.

Load pystac items into an Xarray

In this code you will :

  • Query a STAC API with pystac-client to get a pystac-items object;
  • Load the pystac-item object into a Xarray.Dataset using odc.stac

Info

This specific example uses the collection landcover from CCMEO's datacube.

import pystac_client
import odc.stac

# Define a bounding box for an AOI (Ottawa) in EPSG:4326
bbox=[-75.8860,45.3157,-75.5261,45.5142]
bbox_crs = "EPSG:4326"
# The landcover collection has a meaningful temporal dimension
# In this example, we filter from 2015 to 2020
date_range = '2015-01-01/2020-12-31'

# Link to ccmeo datacube stac-api
stac_root = "https://datacube.services.geo.ca/stac/api"
catalog = pystac_client.Client.open(stac_root)

search = catalog.search(
    collections=['landcover'], 
    bbox=bbox,
    datetime=date_range,) 

# Re-order the proj:transform
result_items = []
# Use the pagination to improve efficiency
# Iterate over returned page and update the transform for each items
for page in search.pages():
    for item in page:
        # reorder_transform() function is define above
        reordered_transform = reorder_transform(
                                item.properties["proj:transform"]
                                )
        item.properties["proj:transform"] = reordered_transform
        result_items.append(item)

# Importing unnecessary assets may cause memory and speed issues.
# To know the assets available in this collection :
print(result_items[0].assets.keys())
# >> dict_keys(['classification', 'thumbnail'])

items_xarray = odc.stac.load(result_items,
                     chunks = {"x": 512, "y": 512},
                     bands = ["classification"], # Asset to load
                     bbox = bbox,)

print(type(items_xarray)) 
# >> <class 'xarray.core.dataset.Dataset'>
# See https://docs.xarray.dev/en/stable/api.html#dataset

# At this point, the metadata and array shape are set, but the data itself isn't read.
# Running .compute() allows Dask to optimize the workflow, evaluating and executing it 
# in the most efficient way, optimizing resource usage.
See working-with-xarray-object or community-notebook-complete-examples section for an example on using Xarray object.

Working with Xarray object

See Xarray.DataArray for details on methods : https://docs.xarray.dev/en/stable/generated/xarray.DataArray.html

1
2
3
4
5
6
7
8
9
# Example : Get the mean surface height for the AOI
# 1. Subtract the dtm to the dsm
# Dimension are [time, band, x, y], and band are : [dsm, dtm]
items_xarray['surface_height'] = items_xarray[:, 0, :, :] - items_xarray[:, 1, :, :]
# items_xarray[:, 0, :, :] -> Getting all from dimension time, x and y, and only element 0 from dimension band
# 2. Compute the calculation, before that, only the workflow was defined
surface_height = items_xarray.surface_height.compute()
# 3. Perform the mean
surface_height.mean()

Community Notebook complete examples