Test AWS Glue jobs locally using Docker and moto

6 min readJan 8, 2023

AWS Architecture. Image by author(created using draw.io)

In this article, we’ll find out how to run unit tests and e2e tests locally for an AWS Glue job which reads data from a Postgres RDS instance and dumps the data into an S3 bucket. Code can be found on GitHub.

Project Architecture

Glue job tries to read the config file stored in the mocked S3 bucket.
Using the configuration details, database credentials are retrieved from the mocked Secrets Manager.
Using credentials, read data from the Postgres instance(container) using Spark into a data frame.
Write this data frame to the mock S3 bucket.

Secrets Manager and S3 need to be mocked because these services cannot be made available locally.

High level Process overview

1. From config.yaml get configuration details like:
    1. Secret name
    2. List of tables which need to be processed 
    3. S3 bucket name
2. Retrieve secret from Secrets Manager
3. Loop over list of tables
      if active_flag is set to True for this table then
          Read from postgres instance using spark.read() 
          with schema which is stored in S3 bucket
          if read was successful
              Write to S3 in specific folder using spark.write()
      else
          Skip processing for this table

Directory Structure (High Level)

glue:.
│   definitions.py       -> Gives path of root directory
│   docker-compose.yaml  -> Docker compose file
│   main.py              -> Glue script
│   pg_hba.conf          -> Postgres client authentication file
│   requirements.txt     -> Required python libraries
│
├───dockerfiles          -> Directory which contains Dockerfile for both containers
│   ├───awsglue        
│   │
│   └───postgres
│
├───input                -> Directory which has files which need to be present in S3 before running the job
│   ├───config           -> Directory which contains config file
│   │
│   └───schema           -> Directory which contains schema files for each table
│   
├───tests                -> Directory which contains all files related to pytests
│   │   conftest.py      -> File which gets executed first when pytest command is run
│   │
│   ├───e2e              -> E2E test case directory
│   |
│   ├───output           -> Directory in which csv files from mock S3 location get downloaded
|   |
│   ├───sqlscripts       -> Directory which contains Postgres initialization script
│   │
│   └───utils            -> Unit test cases directory
│
└───utils                -> Helper files keeping Single Responsibility principle in mind

Docker compose file

There are two containers namely sd_glue_pytest and postgres. We are specifying the Dockerfile locations for these containers, port mapping for container and local machine and volumes to be mounted to the container.

Requirements to run Pytests locally

S3 needs to be mocked -> can be done using moto
Secrets Manager needs to be mocked -> can be done using moto
We need a Postgres instance available locally (To act like an RDS instance) -> can be done using Postgres docker image
Postgres instance should have all the required tables created before running the tests -> can be done using conftest.py
spark_context, spark_session, glue_context objects need to be configured so that the mock S3 service is accessible to them

Setup Postgres and mock AWS services using conftest.py

conftest.py contains all the fixtures. Fixtures are functions wrapped by the pytest.fixture decorator. Fixtures are used to pass data or dependencies to tests. Fixtures have a parameter called scope , which defines how long the fixture will persist. The default value for scope is function, so for each test function a new fixture object will be created. Since we are using scope=session in this instance, all fixtures will be produced only once and reused throughout the course of execution.

Every time, we run our tests, conftest.py will be the first file to be executed.

Fixtures created in our case

setup_rdbms: Reads /tests/sqlscripts/postgres_ddl.sql and executes all SQL statements present in it. SQL statements include DROP, CREATE TABLE and INSERT queries.
moto_server: Starts a moto server and yield s it.
s3_client: Creates a mock S3 client pointing to the mock S3 endpoint, uploads the config and schema files to our S3 bucket and yields the S3 client.
spark_create: Creates spark_context, spark_session, glue_context objects and sets hadoop_conf, so that spark can access the mock S3 endpoint. Yields the spark objects.
secret_client: Creates a mock secrets manager client and yields it.

E2E test case

e2e test function in test_main.py

In our glue job’s main script, we are creating spark context and glue context objects through createContexts(). While running tests locally, they need additional configurations which are managed in conftest.py. So when running e2e test case, we need to use the context objects created by the conftest.py script instead of the ones created by the main script.

For achieving this, we patch glue.main.DataTransfer.createContexts using the pytest.patch decorator. Patching means replacing an object using a test double. We know beforehand what createContexts() will return and hence we can patch it using the objects created using spark_create fixture. Patching is useful when external dependencies cannot be mocked and to isolate our tests from them.

After patching, we call the main() function. After the main function has completed execution, we know that the data should have been loaded to the mock S3 endpoint. So we try to download the files from there to our local system(in output/ directory) and then manually verify them.

Similarly, we have written unit test cases for each function present in the utils/ folder.

How to run

Clone the code using git clone https://github.com/sagadevanmi/Pytest-for-AWS-Glue.git
cd to the directory where the code is cloned and then cd to glue/ folder
Change the docker-compose.yaml file to mount your local directory to the remote container.
Run the following commands in sequence

docker-compose up -d
docker cp .\pg_hba.conf postgres:/var/lib/postgresql/data/pg_hba.conf
docker restart postgres

If Access Denied is printed, try again, you’ll get a prompt to Allow/Reject

Actions performed by these commands:

pull the images, setup the containers and start them.
Copy the pg_hba.conf file from our repo to the postgres container and restart the postgres container.

If the Dockerfile or docker-compose files are changed, run docker-compose build --no-cache to build new containers followed by docker-compose up -d to run the new containers.

4. Attach VS Code to the running container. Can be found here

5. Run pytest -o log_cli=TRUE --log-cli-level=INFO tests/. This will run all the unit tests and e2e tests present inside the tests/ folder.

If everything was fine, we’ll get an output on the command line saying 9 passed.

6. If you want to run the e2e test case only then run pytest -o log_cli=TRUE --log-cli-level=INFO tests/e2e/test_main.py

Enhancements which can be made

Create audit tables and add run status(success/failure), point of failure, number of rows added for each table(So that we can resume from the point of failure)
Find test coverage
Linting using pylint
Extend the process: move data from S3 to a warehouse like Redshift and build a dashboard using QuickSight

I hope you found this helpful. Thank you for reading!

References

Docker: https://docs.docker.com/compose/gettingstarted/

Pytests: https://realpython.com/pytest-python-testing/

Getting Started Unit Testing with Pytest

A tool no data scientist thinks they need until they know otherwise.

towardsdatascience.com

Moto: https://docs.getmoto.org/en/latest/