Serve Data from S3 using Hyax and s3fs

Step-by-Step Guide

  • Turn off s3fs cache when you mount
  • # ec-2user is 1000 on default AWS linux. $s3fs sdt-data -o allow_other -o uid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/hyrax/build/share/hyrax/data/hdf4 # The following example mounts MRF bucket on GeoServer. $s3fs ceres-mrf -o allow_other -o uid=1000 -o gid=1000 -o mp_umask=022 -o multireq_max=5 /home/ec2-user/geoserver-2.15.2/data_dir/data/ceres-mrf -ouse_cache=/tmp # You can also specify path after bucket name. On CentOS, uid:gid=1001 is centos:centos. $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o multireq_max=5 /usr/share/hyrax/data/hdf4 # Use cache to boost performance. $s3fs sdt-data:/ceres/SYN1deg-1Hour/Terra-Aqua-MODIS_Edition4A/2017/01 -o allow_other -o uid=1001 -o gid=1001 -o mp_umask=022 -o multireq_max=5 /usr/share/hyrax/data/hdf4 -ouse_cache=/tmp
  • Turn off MDS in Hyrax
  • Give the right permission on HDF files
  • $chmod go+r /usr/share/hyrax/data/hdf4/*.hdf

Running make clean on ~/hyrax/build will free up some space. Use CentOS 7 and RPM installation to save more space. NcML on S3 works if permission is set right, which means that you can mount different buckets.


Performance Test

Test Setting
  • t2.micro 1CPU and 1G memory
  • t2.2xlarge 8CPU and 32G memory
  • Hyrax 1.15.4 / CentOS 7 x86_64
  • AWS region: us-east-1

  • The Effect of Loadbalancer and Autoscaling

    We put Hyrax under Loadbalancer with minimum 1 and maximum 5 instances. When 10 CERES granules (1/22/2017 ~ 1/31/2017) are processed simultaneously, only 2 (1/23,1/31) succeeded. CMR to VRT generation succeeded only 1 (1/23). Some errors are Gateway error and it is due to the short default timeout value in AWS Loadbalancer, which is 60 seconds. If you increase it to 900 seconds to match Hyrax bes.conf, you will mostly see errors due to Hyrax server itself.


    The Effect of s3fs Caching

    If you turn on caching on s3fs mount option, Hyrax simply doesn't work with the following message:

    context: Error { code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170120.hdf. It is very possible that this file is not an HDF4 file.";}^
    The Effect of Vertical Scaling

    If you increase the capacity of your instance, you can notice the speed-up easily. The following test is slicing CERES by making 24 requests using netCDF-API.

    1 Granule 24 Slices

    Instance Time in Seconds (Minutes) Cost
    t1.micro 2998.74 (49) 0.01 per hour
    t2.2xlarge 399.21 (6) 0.47 per hour

    5 Granules 24 Slices

    Instance Time in Seconds (Minutes)
    t1.micro
    t2.2xlarge 1924.73 (32)

    t2.2xlarge failed after processing 4 granules from 1/20 to 1/23:

    http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml syntax error, unexpected $end, expecting ';' context: Error { code = 500; message = "HDF4 SDstart error for the file /usr/share/hyrax/data/hdf4/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf. It is very possible that this file is not an HDF4 file.";}^ Traceback (most recent call last): File "test_hyrax.py", line 21, in dataset = Dataset(url) File "netCDF4/_netCDF4.pyx", line 2135, in netCDF4._netCDF4.Dataset.__init__ File "netCDF4/_netCDF4.pyx", line 1752, in netCDF4._netCDF4._ensure_nc_success OSError: [Errno -70] NetCDF: DAP server error: b'http://54.164.64.119:8080/opendap/data/ncml_s3/CER_SYN1deg-1Hour_Terra-Aqua-MODIS_Edition4A_406406.20170124.hdf.ncml'

    The above error is caused by file corruption


    The Effect of Private Network

    Time is measured with t1.micro Hyrax server. Faster network within the same region doesn't help if server is a bottleneck.

    Network Time in Seconds (Minutes) Note
    Internet 2998.74 (49) Mac OS X from The HDF Group
    Private 3476.60 (58) t1.micro

    The Effect of Elastic File System

    1 Granules 24 Slices

    Test was done on t1.micro instance.

    EFS Time in Seconds (Minutes) Cost
    No 2998.74 (49) 0.01 per hour
    Yes 33.92 (0.56) Standard Storage: 0.30/G per month Throughput (MB/s-Month): $6/month