Comparing AWS Glue Python Shell vs AWS Batch
An opinionated comparison
AWS Glue Python Shell job type offers functionality to run general purpose small-medium sized tasks written in Python on AWS Glue. The scripts executed inside a Python shell does not have Spark Context as the execution environment is just a Python shell. More on Glue Python Shell here.
But the environment comes with lot of restrictions and may not be a suitable fit for all use cases. In the following sections, let us take a look at various limitations and a possible alternative.
Please note that the article assumes that the job has to be written in Python only. If that is not the case, then Glue Python shell is already not the right choice for you. 😁
Glue Python Shell
With a Python shell job, you can run scripts that are compatible with Python 2.7 or Python 3.6.
The Python versions are limited to 2.7 and 3.6. If the task you are planning to execute depends on a different version of Python, then Python Shell based job is not a suitable option.
The environment also comes with a finite set of preloaded libraries. These libraries can be used in the Python script file. If the task execution requires additional libraries or custom libraries to be installed in the Python virtual environment, the same can be managed using setup tools. However, there can be cases where there can be a version mismatch between the transitive dependencies of the preloaded libraries and the transitive dependencies of the library you are willing to use. In such cases, writing the task and managing dependencies can become quite tricky.
The underlying environment is not very customisable as Glue Python Shell provides a serverless environment similar to Lambda. This may create few problems similar to this one on stackoverflow which requires some amount of customisation.
AWS Batch runs on EC2 or Fargate instances created based on a docker image. The docker image can be custom built to meet the requirements of use case at hand. The necessary dependencies can be easily managed using a dependency management too like Poetry.
Ease of deployment
Glue Python Shell
The Python script file to be executed as a job should be stored in an S3 bucket. The execution environment then picks up the file based the S3 url configuration parameter provided during job creation. The deployment pipeline should target to update the file present in S3.
If the code related to the job is large and cannot be maintained in a single file. It can be written as multiple python modules but as the configuration of the job only accepts a single file, the full Python module should be bundled as an
whl or an
egg file.The bundled Python module can then be added as a custom library to the job configuration. Glue Python Shell environment ensures these modules are installed before executing the script file.
Though the process stated above seems like it is something that can achieved using CI/CD, it is not clean.
As already stated in the previous section, the environment, being created based on a docker image, is highly customisable. The deployment pipeline can build and push the docker image to a repository on ECR. The image can then be configured as a Job Definition.
Batch in total requires three artefacts to be created for executing a job.
A compute environment which carries the configuration related to what is available as a compute. Whether the underlying infra is based on EC2 and Fargate and what is the total number of vCPUs and memory available for the jobs to execute.
A job queue which holds the jobs to be executed at any given point of time. The compute environment which should be used by the job queue is also configured
A job definition which contains the docker image details, entry point information and other configuration such as environment variables, volumes etc.
These infra artefacts can be maintained as code using either Cloud formation or Terraform.
Glue Python Shell
An AWS Glue job of type Python shell can be allocated either 1 DPU or 0.0625 DPU. By default, AWS Glue allocates 0.0625 DPU to each Python shell job. You are billed $0.44 per DPU-Hour in increments of 1 second, rounded up to the nearest second, with a 1-minute minimum duration for each job of type Python shell.
If we assume our job requires 1 DPU, which is 4 vCPU and 16 GB of memory, running a daily job which takes 1 hour to complete would cost approximately a $13.2 per month.
The S3 storage would costs $0.023 per GB. But the python script file and its necessary libraries will be much smaller than a giga byte
There is no additional charge for AWS Batch. You pay for AWS resources (e.g. EC2 instances, AWS Lambda functions or AWS Fargate) you create to store and run your application
A similar batch job as defined in the previous section would cost approximately $6 per month using on-demand EC2 instances with 10 GB EBS storage.
Additional to this, Batch also offers option to schedule Spot Instances. If spot instances can be used depending on the nature of the job execution, the pricing can be much lesser.
The docker image storage on ECR would cost $0.10 per GB/month
Going by the above comparisons on various factors, unless there is a specific requirement to run a job on Python Shell only, it is always a better choice to go for Batch over Python Shell.