January 25, 2019

Limitations of traditional EO data processing workflows

  • Requires data to be present locally
  • Bottleneck for processing large data volumes
    • Bandwidth
    • Computing power
  • openEO aims to develop an alternate workflow supporting cloud processing
    • Moving code to the data is easier for big EO data
    • Clients send processing requests to cloud back-ends over HTTP using openEO API

User-defined Functions (UDFs)

UDFs allow users to run their custom code on EO data too instead of only those natively available in the cloud back-ends.

Objectives

With this background, it's easy to see:

  • UDF service, in principle, seems a much-needed infrastructure
  • However, is it actually useful for the geospatial community?
    • Advantages over processing locally?
    • Advantages over other similar cloud-based EO data processing infrastructures?

Research Questions

  1. Whether it is possible to develop an infrastructure that enables users to run their custom functions in R on EO data in the cloud?
  2. Whether an UDF service in the context of the openEO API actually useful for the geospatial community to perform EO data analysis?

Strategies for implementing UDFs in R

  • R package openEO.R.UDF (Ghosh and Lahn, 2019) implements the 4 strategies to develop a proof-of-concept UDF service.
  • All the strategies convert the EO data to a stars object (Pebesma, 2018). This contributes towards the first research question.

Experiments to determine usefulness

Quantitative and qualitative experiments are performed to determine the usefulness from 3 partially overlapping perspectives

  • Usability: Processing time, volume of data transferred, ease of implementing operations
  • Functionality: Types of possible operations
  • Scalability: Room for accommodating big EO data processing, possibility for a database of UDFs

Data

Experiments were performed on Sentinel-2 images: 300x300px spatial subsets, 3 time-steps with a temporal resolution of ~10 days containing 13 bands each

Infrastructure

  • Server deployed at http://jstatsoft.uni-muenster.de running Docker containers of the R package openEO.R.UDF (Ghosh and Lahn, 2019) implementing the strategies
  • File-based service integrated with the R reference back-end: openeo-r-backend (Lahn and Ghosh, 2018)
  • REST end-points implemented in R using plumber (Trestle Technology, 2018)

Results: file-based vs. JSON arrays (list)

  • File-based: UDF service as a part of the back-end receiving TIFFs exposed as lists
  • Web-based: RESTful web-service with JSON arrays representing pixel values exposed as lists

- Functionally, these are suitable for simple reducing operations only

Results: JSON arrays and base64 encoded strings

  • JSON arrays: Pixel values are represented as JSON arrays exposed as a stars object
  • Base64 encoded string: GeoTIFF files in a ZIP file represented as a base64 encoded string

- Functionally, these could handle complex operations

Results: Qualitative aspects

Usability

  • Assumes no special environment: user writes R code as on his/her local machine except
    • user writes an anonymous R function definition with the incoming stars object as its sole argument at the end of his/her script which acts as a wrapper
    • this anonymous function returns a stars object with proper attributes
  • No need for the user to keep track of data's dimensions
  • Currently only supports raster data - the input and the output of the UDF service must be rasters

Functionality

  • May install and load any R package (e.g. using install.packages(), devtools::install_github(), library()) in the UDF script
    • lot of freedom to the user
  • Possible to download additional data from any online source (implemented) and upload additional files accessible to the UDF (not implemented)

Comparison with similar infrastructures

A comparative study with Google Earth Engine (GEE) is yet to be completed. Preliminary results:

Advantages

  • Exposes not only an interface, but offers full control over it. GEE offers a JS sandbox and a Python client
  • Ability to load any R package and data implies a large number of possibilities
    • running C++ codes
    • GEE offers custom data upload but functionally does not offer such flexibilities (?)
  • RESTful implementation uses generic formats and technologies: GeoTIFFs, base64 encoding and JSON

Disadvantages

  • Current implementation not scalable. GEE is designed to be scalable and hence supports handling larger data volumes
    • Better implemented by the back-ends. (e.g. chunking data before sending to UDF service)
  • No native support of data other than rasters (e.g. CSVs, vector data etc.)
  • No interactive experience for the user (yet)
    • But possible in the light of new development in stars

Conclusion

  • The R UDF service proves the possibility to develop infrastructure to run users' custom R functions on EO data in the cloud

  • Practically useful
    • Constraints on users minimal
    • Offers a flexible platform with no special pre-requisite knowledge
    • Current implementation offers more-or-less fast processing (Room for improvement with infrastructure for scalability)
  • Comparatively
    • Better alternative to process code locally (bandwidth, computing power etc.)
    • From some perspectives it is better than similar infrastructures (e.g. GEE); requires more work to offer all possibilities (not an objective!)

Thank you!

Useful Links

References