William Lee

You only push once - Cross Region Replication for AWS ECR

William Lee — Wed, 31 Jul 2024 06:00:00 GMT

It's a common pattern where you need to push your docker image and make it readily available to multiple servers in multiple regions.

Patterns

Over time i've seen the following patterns take shape:

Pattern 1: Single Image/Multiple ECR per Region

Illustration of docker image pushed to each region's ECR - which is read by servers

Pros:

Fast instance creation times - Each server instance has fast network access to their same region ECR.
Explicit Architecture.

Cons:

DevOps Maintenance
- Must write code that is concerned about pushing to each region.
- More moving parts/mental load.
- More error prone - Need to retry and babysit the deploy if the push to one region is disrupted.
Cost - Per region ECR billing (1 USD a month as of writing per ECR).

Pattern 2: Single Image/Single Region ECR Source

Illustration of docker image being pushed to a single Region ECR and other regions referencing the image

Pros:

Simpler DevOps/Mental load: Docker image is only pushed to a single ECR.
Cheaper: Single ECR to manage.

Cons:

Slower instance creation times: Due to geographical constraints affecting network speeds - instances will take longer to create/spin up. This will be more evident when autoscaling latency is crucial.

Conclusion

Pattern 1 has the advantage of quick instance creation from a docker definition due geographical redundancy, but has too many moving parts from a DevOps experience perspective

Pattern 2 from a DevOps perspective is simple to grok. However the lack of geographical redundancy might be a deal breaker - imagine if tasks in the UK region need to scale up quickly - but need to fetch the docker definition from the source region in Australia.

Single ECR/CRR - The best of all patterns

This approach takes the best of the solutions we explored above - and leverages AWS Cross Region Replication to fill in the cracks.

It's important to grok that it's a per region setting (for private and public repositories) thus we need to filter ECRs via name.

Illustration of docker image being pushed into a single source ECR - and Cross Region Replication copying it automatically to other regions

Pros:

Less IaC /Easy to grok (Single Region/ECR push).
Free: Just pay for ECR storage costs - and uses the AWS backbone for network transfer/replication,
Fast Replication: I personally found It takes around 15 seconds for an image to be available on the other side of the world for a 500mb image.
Supports Cross Account replication
- Great for Disaster Recovery.
- Great for multiple environments (i.e develop and production),
No need to manually create corresponding ECR in other regions - CRR creates them for you (although as outlined below - if using IaC, it's recommended to create them first before enabling CRR).
ECS handles that slight delay in replication by re-querying ECR until the image is there.

Cons

CRR happens behind the scenes so finding the originating region must be found in the terraform or in the AWS console by inspecting the settings of the repo.

IaC gotchas

It's better to explicitly create the ECRs in each region required first and then apply CRR.
- This is because IaC may become confused and try create ECRs in the other regions for you - when you have to reference them once again (e.g referencing them for deployment code).
Remember that CRR is a per region setting - and you should specify CRR infrastructure code in one place.
- I was caught out where we had CRR IaC code in two repositories. This creates an unfortunate situation where each repository's deployments overrode the CRR settings of the other.

Takeaway notes

It's definitely worth exploring CRR!

We found our build times were cut by 20 minutes (billable) - In our case this includes an image build for each and every region (and finally pushing).
Improved DevOps Experience - we didn't have to worry about each individual region painstakingly - or worry about network issues breaking things requiring a retry - we only push once - and the image becomes available in every ECR/region we required.
Reduced lines of code relating to multiple regions (only need to reference the source region).

Improve Integration Tests With GitHub Action Service Containers

William Lee — Mon, 11 Sep 2023 11:45:08 GMT

Setting The Scene

Typically integration test suites require a set of temporary docker services to connect with such as a relational database, or some caching service.

Integration Test Suite/Services Architecture

We tend to set up services via a docker-compose.yml file - so integration tests can have services exposed to them and be torn down easily.

How Can We Improve On This?

We can explore GitHub service containers!

They allows us to specify docker images and which ports to expose to the github runner that is running your test suite. The syntax is almost identical to a docker compose yml file. The only differences I found were declaring env variables and other nuances such as specifying bespoke health check parameters.

Why Use Github Action Service Containers?

No need to write you own health check scripts

GitHub Actions does this for you.

Screenshot of included health checks

Some services such as elastic search offer extra health check options which you specify under env

elasticsearch:
  image: docker.elastic.co/elasticsearch/elasticsearch:7.1.1
  ports:
    - 9200:9200
    - 9300:9300
  env:
    discovery.type: single-node
    options: >-
      --health-cmd "curl http://localhost:9200/_cluster/health"
      --health-interval 10s
      --health-timeout 5s
      --health-retries 10

Cost/Time savings - I found they cut build times by around a minute

Before (running docker build):

After:

Faster than utilizing docker layer caching.

I found using GitHub Actions Service Containers faster/more efficient than creating a job or docker build and then utilizing docker layer caching, such that subsequent jobs only need to fetch from the cache.

The reason being reading docker layers from the cache was far slower than the process GitHub Action Service Containers go through (sidenote: GitHub don't charge for ingress).

Less set up for individual tests.

If you're running your integration tests in a matrix/parallel each individual test needs to set up services individually (added bonus: each test runs in their own runner so we don't have to worry about memory - we can retry "just" the test that failed if we have flakey tests too).

If this only takes a minute for each runner instead of 2 - we get compounded time saved and when running your integration tests in parallel we're only as slow as our slowest test - so we save on total wait time.

When You Might Not Want To Use Them:

Disparate testing processes between your build environment and local development environment

Using a docker-compose.yml is same experience both locally and CI, we hard code versions and tags in our GitHub action workflow files so they can become out of date/ might be a gotcha.

Could be negated if run your Github Actions locally using nektos/act if you want local verification - but not always desirable.

Two docker versions to keep track of - Local and GitHub Action workflow

Again - Could be negated if run your Github Actions locally using nektos/act if you want local verification - but not always desirable

Larger/messier GitHub action workflows files.

We can go from one single neat command to set everything up (i.e one single command which invokes all the necessary start up scripts and post health checks) to an inflated number of lines specifying the service definitions.

Summary

I've found GitHub Actions Service Containers to cut down service set up by around 50% in all of my cases. For situations where you need to run your tests in parallel and each task runner has to set up their own services - the pros for using services containers outweight the cons.

The major cost in my opinion is having duplicate docker image definitions between local and in CI that must be manually synced.

We should be striving for faster build times and developer experience. Slow integration tests in CI are more common than you think. Running them in parallel means tests run in their own environment, ease of visibility in GitHub - as opposed to all tests running in on one runner and just seeing a sprawl of console logs and finally allow developers to easily retry just the test that failed as opposed to the entire suite again.