Test AWS Glue jobs locally using Docker and moto
In this article, we’ll find out how to run unit tests and e2e tests locally for an AWS Glue job which reads data from a Postgres RDS instance and dumps the data into an S3 bucket. Code can be found on GitHub.
Project Architecture
- Glue job tries to read the config file stored in the mocked S3 bucket.
- Using the configuration details, database credentials are retrieved from the mocked Secrets Manager.
- Using credentials, read data from the Postgres instance(container) using Spark into a data frame.
- Write this data frame to the mock S3 bucket.
Secrets Manager and S3 need to be mocked because these services cannot be made available locally.
High level Process overview
1. From config.yaml get configuration details like:
1. Secret name
2. List of tables which need to be processed
3. S3 bucket name
2. Retrieve secret from Secrets Manager
3. Loop over list of tables
if active_flag is set to True for this table then
Read from postgres instance using spark.read()
with schema which is stored in S3 bucket
if read was successful
Write to S3 in specific folder using spark.write()
else
Skip processing for this table
Directory Structure (High Level)
glue:.
│ definitions.py -> Gives path of root directory
│ docker-compose.yaml -> Docker compose file
│ main.py -> Glue script
│ pg_hba.conf -> Postgres client authentication file
│ requirements.txt -> Required python libraries
│
├───dockerfiles -> Directory which contains Dockerfile for both containers
│ ├───awsglue
│ │
│ └───postgres
│
├───input -> Directory which has files which need to be present in S3 before running the job
│ ├───config -> Directory which contains config file
│ │
│ └───schema -> Directory which contains schema files for each table
│
├───tests -> Directory which contains all files related to pytests
│ │ conftest.py -> File which gets executed first when pytest command is run
│ │
│ ├───e2e -> E2E test case directory
│ |
│ ├───output -> Directory in which csv files from mock S3 location get downloaded
| |
│ ├───sqlscripts -> Directory which contains Postgres initialization script
│ │
│ └───utils -> Unit test cases directory
│
└───utils -> Helper files keeping Single Responsibility principle in mind
Docker compose file
There are two containers namely sd_glue_pytest
and postgres
. We are specifying the Dockerfile
locations for these containers, port mapping for container and local machine and volumes to be mounted to the container.
Requirements to run Pytests locally
- S3 needs to be mocked -> can be done using
moto
- Secrets Manager needs to be mocked -> can be done using
moto
- We need a Postgres instance available locally (To act like an RDS instance) -> can be done using Postgres docker image
- Postgres instance should have all the required tables created before running the tests -> can be done using conftest.py
spark_context
,spark_session
,glue_context
objects need to be configured so that the mock S3 service is accessible to them
Setup Postgres and mock AWS services using conftest.py
conftest.py
contains all the fixtures. Fixtures are functions wrapped by the pytest.fixture
decorator. Fixtures are used to pass data or dependencies to tests. Fixtures have a parameter called scope
, which defines how long the fixture will persist. The default value for scope is function, so for each test function a new fixture object will be created. Since we are using scope=session
in this instance, all fixtures will be produced only once and reused throughout the course of execution.
Every time, we run our tests, conftest.py
will be the first file to be executed.
Fixtures created in our case
- setup_rdbms: Reads
/tests/sqlscripts/postgres_ddl.sql
and executes all SQL statements present in it. SQL statements includeDROP
,CREATE TABLE
andINSERT
queries. - moto_server: Starts a moto server and
yield
s it. - s3_client: Creates a mock S3 client pointing to the mock S3 endpoint, uploads the config and schema files to our S3 bucket and yields the S3 client.
- spark_create: Creates
spark_context
,spark_session
,glue_context
objects and setshadoop_conf
, so that spark can access the mock S3 endpoint. Yields the spark objects. - secret_client: Creates a mock secrets manager client and yields it.
E2E test case
In our glue job’s main script, we are creating spark context and glue context objects through createContexts()
. While running tests locally, they need additional configurations which are managed in conftest.py
. So when running e2e test case, we need to use the context objects created by the conftest.py
script instead of the ones created by the main
script.
For achieving this, we patch glue.main.DataTransfer.createContexts
using the pytest.patch
decorator. Patching means replacing an object using a test double. We know beforehand what createContexts()
will return and hence we can patch it using the objects created using spark_create
fixture. Patching is useful when external dependencies cannot be mocked and to isolate our tests from them.
After patching, we call the main()
function. After the main function has completed execution, we know that the data should have been loaded to the mock S3 endpoint. So we try to download the files from there to our local system(in output/
directory) and then manually verify them.
Similarly, we have written unit test cases for each function present in the utils/
folder.
How to run
- Clone the code using
git clone https://github.com/sagadevanmi/Pytest-for-AWS-Glue.git
cd
to the directory where the code is cloned and thencd
toglue/
folder- Change the
docker-compose.yaml
file to mount your local directory to the remote container. - Run the following commands in sequence
docker-compose up -d
docker cp .\pg_hba.conf postgres:/var/lib/postgresql/data/pg_hba.conf
docker restart postgres
Actions performed by these commands:
- pull the images, setup the containers and start them.
- Copy the
pg_hba.conf
file from our repo to thepostgres
container and restart thepostgres
container.
If the Dockerfile
or docker-compose
files are changed, run docker-compose build --no-cache
to build new containers followed by docker-compose up -d
to run the new containers.
4. Attach VS Code to the running container. Can be found here
5. Run pytest -o log_cli=TRUE --log-cli-level=INFO tests/
. This will run all the unit tests and e2e tests present inside the tests/
folder.
If everything was fine, we’ll get an output on the command line saying 9 passed.
6. If you want to run the e2e test case only then run pytest -o log_cli=TRUE --log-cli-level=INFO tests/e2e/test_main.py
Enhancements which can be made
- Create audit tables and add run status(success/failure), point of failure, number of rows added for each table(So that we can resume from the point of failure)
- Find test coverage
- Linting using
pylint
- Extend the process: move data from S3 to a warehouse like Redshift and build a dashboard using QuickSight
I hope you found this helpful. Thank you for reading!
References
Docker: https://docs.docker.com/compose/gettingstarted/