Unsticking Glue

How to use AWS Glue With external libraries in a private VPC

Introduction

Running AWS Glue jobs in a private VPC with external libraries can run into problems in many scenarios. This is due to the fact that an AWS Glue job needs to download dependency libraries from the internet while internet access is blocked by the private VPC. In this blog, we outline the detailed debug process needed to provide a potential solution to this problem.

The issue of running Glue inside private VPC

In this scenario, an Aurora RDS instance is running inside a private VPC for security purposes. A Pythonshell Glue job needs to read data from a S3 bucket and insert the processed data into this RDS instance. As the data in S3 is in parquet format, the pyarrow library is required to read parquet data. We downloaded and attached the pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl as an external pyarrow.

Compared to a PySpark Glue job, a Pythonshell job is lightweight and cheap to run, which makes it suitable for data sizes that are not too large. However, unlike the PySpark job, it does not have built-in support for the Parquet data format, which is why the external pyarrow library is required.

However, after attaching the pyarrow library to the Glue job and running it, the Glue job failed to start. What's wrong with this setup?

The whl file needs to download its Dependencies

In the CloudWatch logs for the Glue job, only one line of trace output is found:

1
Processing ./glue-python-libs-vfw2id1k/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl

While in error logs, the following lines can be found:

1
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
2
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a97518>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
3
WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a975c0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
4
WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a972b0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
5
WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a976a0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
6
WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a97160>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/
7
ERROR: Could not find a version that satisfies the requirement six>=1.0.0 (from pyarrow==0.15.1) (from versions: none)
8
ERROR: No matching distribution found for six>=1.0.0 (from pyarrow==0.15.1)
9
Traceback (most recent call last):
10
File "/tmp/runscript.py", line 117, in <module>
11
download_and_install(args.extra_py_files)
12
File "/tmp/runscript.py", line 63, in download_and_install
13
subprocess.check_call([sys.executable, "-m", "pip", "install", "--target={}".format(install_path), local_file_path])
14
File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_call
15
raise CalledProcessError(retcode, cmd)
16
subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-vfw2id1k/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl']' returned non-zero exit status 1.

It looks like the Glue job's internet access is blocked due to it running in a private VPC. But, we need more proof to verify this.

Proof of Glue’s Internet access for downloading dependencies

To verify that the Glue job's failure is due to internet access blockage, we can temporarily give the private subnet internet access. In the routing table of the private subnet, an entry routes all internet traffic to a NAT gateway in a public subnet:

private subnet

Running the Glue job again results in it finishing successfully. The log is:

1
Processing ./glue-python-libs-a5wh8mr_/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
2
Collecting numpy>=1.14
3
Downloading numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl (14.6 MB)
4
Collecting six>=1.0.0
5
Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)
6
Installing collected packages: numpy, six, pyarrow
7
Successfully installed numpy-1.19.0 pyarrow-0.15.1 six-1.15.0

From the logs, we can see that the Glue job needs to download the dependencies six and numpy from the internet. This is a serious problem in production as internet access is not available in private VPC.

Does it really need to download dependencies?

A number of Python modules are supported by a Glue job out of the box. One can run the code snippet inside a Glue job to retrieve the supported Python module list:

python
1
import pkg_resources
2
for x in pkg_resources.working_set:
3
print(x)
4

The output is

1
zipp 3.1.0 wheel 0.34.2 virtualenv 20.0.20 urllib3 1.25.9 six 1.14.0 setuptools 46.1.3 scipy 1.2.1 scikit-learn 0.20.3 s3transfer 0.2.1 rsa 3.4.2 requests 2.22.0 PyYAML 5.2 pytz 2020.1 python-dateutil 2.8.1 PyGreSQL 5.0.6 pyasn1 0.4.8 pip 20.1 pandas 0.24.2 numpy 1.16.2 jmespath 0.9.5 importlib-resources 1.5.0 importlib-metadata 1.6.0 idna 2.8 filelock 3.0.12 docutils 0.15.2 distlib 0.3.0 colorama 0.3.9 chardet 3.0.4 certifi 2020.4.5.1 botocore 1.12.232 boto3 1.9.203 awscli 1.16.242 appdirs 1.4.3

Among them, we can see that six and numpy are included. So why can't the pyarrow module use the natively supported modules as dependencies?

Edit whl file to resolve the issue

We found a potential solution to this issue. The dependency of a python module in whl format is written inside a file in it and can be removed by editing this file. The whl module is packed in zip format. Unzipping the pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl creates two folders: pyarrow and pyarrow-0.15.1.dist-info. Edit the file pyarrow-0.15.1.dist-info/METADATA and remove the following lines and save it:

1
Requires-Dist: numpy (>=1.14)
2
Requires-Dist: six (>=1.0.0)
3
Requires-Dist: futures; python_version < "3.2"
4
Requires-Dist: enum34 (>=1.1.6); python_version < "3.4"

The file pyarrow-0.15.1.dist-info/RECORD contains sha256 digest and size of all other files inside this module. As the METADATA file was modified, its sha256 digest and size should be updated as well. Run the following code to calculate the sha256 digest:

python
1
import hashlib
2
from pip._internal.utils.misc import read_chunks
3
from base64 import urlsafe_b64encode
4
5
def rehash(path, blocksize=1 << 20):
6
# type: (str, int) -> Tuple[str, str]
7
"""Return (hash, length) for path using hashlib.sha256()"""
8
h = hashlib.sha256()
9
length = 0
10
with open(path, 'rb') as f:
11
for block in read_chunks(f, size=blocksize):
12
length += len(block)
13
h.update(block)
14
digest = 'sha256=' + urlsafe_b64encode(
15
h.digest()
16
).decode('latin1').rstrip('=')
17
# unicode/str python2 issues
18
return (digest, str(length))
19

Edit the pyarrow-0.15.1.dist-info/RECORD file, find the line like the following:

1
pyarrow-0.15.1.dist-info/METADATA,sha256=xxxx

Replace the sha256 digest in this line with the output from the code snippet above. Save the file and zip the pyarrow and pyarrow-0.15.1.dist-info into one file and rename the zip file to the original name pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl. Run the Glue job with the modified whl file in a private VPC with no internet access. The Glue job will now run successfully.

1
Processing ./glue-python-libs-tbnhyh5i/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
2
Installing collected packages: pyarrow
3
Successfully installed pyarrow-0.15.1

Authored by Edward Liu, Darrell Chua, Tim Elson, and Andrew Perulero.