Running AWS Glue jobs in a private VPC with external libraries can run into problems in many scenarios. This is due to the fact that an AWS Glue job needs to download dependency libraries from the internet while internet access is blocked by the private VPC. In this blog, we outline the detailed debug process needed to provide a potential solution to this problem.
In this scenario, an Aurora RDS instance is running inside a private VPC for security purposes. A Pythonshell Glue job needs to read data from a S3 bucket and insert the processed data into this RDS instance. As the data in S3 is in parquet format, the pyarrow library is required to read parquet data. We downloaded and attached the pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
as an external pyarrow.
Compared to a PySpark Glue job, a Pythonshell job is lightweight and cheap to run, which makes it suitable for data sizes that are not too large. However, unlike the PySpark job, it does not have built-in support for the Parquet data format, which is why the external pyarrow library is required.
However, after attaching the pyarrow library to the Glue job and running it, the Glue job failed to start. What's wrong with this setup?
In the CloudWatch logs for the Glue job, only one line of trace output is found:
Processing ./glue-python-libs-vfw2id1k/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
While in error logs, the following lines can be found:
WARNING: The directory '/.cache/pip' or its parent directory is not owned or is not writable by the current user. The cache has been disabled. Check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a97518>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a975c0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a972b0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a976a0>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.VerifiedHTTPSConnection object at 0x7f3228a97160>, 'Connection to pypi.org timed out. (connect timeout=15)')': /simple/six/ERROR: Could not find a version that satisfies the requirement six>=1.0.0 (from pyarrow==0.15.1) (from versions: none)ERROR: No matching distribution found for six>=1.0.0 (from pyarrow==0.15.1)Traceback (most recent call last):File "/tmp/runscript.py", line 117, in <module>download_and_install(args.extra_py_files)File "/tmp/runscript.py", line 63, in download_and_installsubprocess.check_call([sys.executable, "-m", "pip", "install", "--target={}".format(install_path), local_file_path])File "/usr/local/lib/python3.6/subprocess.py", line 311, in check_callraise CalledProcessError(retcode, cmd)subprocess.CalledProcessError: Command '['/usr/local/bin/python', '-m', 'pip', 'install', '--target=/glue/lib/installation', '/tmp/glue-python-libs-vfw2id1k/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl']' returned non-zero exit status 1.
It looks like the Glue job's internet access is blocked due to it running in a private VPC. But, we need more proof to verify this.
To verify that the Glue job's failure is due to internet access blockage, we can temporarily give the private subnet internet access. In the routing table of the private subnet, an entry routes all internet traffic to a NAT gateway in a public subnet:
Running the Glue job again results in it finishing successfully. The log is:
Processing ./glue-python-libs-a5wh8mr_/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whlCollecting numpy>=1.14Downloading numpy-1.19.0-cp36-cp36m-manylinux2010_x86_64.whl (14.6 MB)Collecting six>=1.0.0Downloading six-1.15.0-py2.py3-none-any.whl (10 kB)Installing collected packages: numpy, six, pyarrowSuccessfully installed numpy-1.19.0 pyarrow-0.15.1 six-1.15.0
From the logs, we can see that the Glue job needs to download the dependencies six
and numpy
from the internet. This is a serious problem in production as internet access is not available in private VPC.
A number of Python modules are supported by a Glue job out of the box. One can run the code snippet inside a Glue job to retrieve the supported Python module list:
pythonimport pkg_resourcesfor x in pkg_resources.working_set:print(x)
The output is
zipp 3.1.0 wheel 0.34.2 virtualenv 20.0.20 urllib3 1.25.9 six 1.14.0 setuptools 46.1.3 scipy 1.2.1 scikit-learn 0.20.3 s3transfer 0.2.1 rsa 3.4.2 requests 2.22.0 PyYAML 5.2 pytz 2020.1 python-dateutil 2.8.1 PyGreSQL 5.0.6 pyasn1 0.4.8 pip 20.1 pandas 0.24.2 numpy 1.16.2 jmespath 0.9.5 importlib-resources 1.5.0 importlib-metadata 1.6.0 idna 2.8 filelock 3.0.12 docutils 0.15.2 distlib 0.3.0 colorama 0.3.9 chardet 3.0.4 certifi 2020.4.5.1 botocore 1.12.232 boto3 1.9.203 awscli 1.16.242 appdirs 1.4.3
Among them, we can see that six
and numpy
are included. So why can't the pyarrow
module use the natively supported modules as dependencies?
We found a potential solution to this issue. The dependency of a python module in whl
format is written inside a file in it and can be removed by editing this file.
The whl
module is packed in zip format. Unzipping the pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
creates two folders: pyarrow
and pyarrow-0.15.1.dist-info
. Edit the file pyarrow-0.15.1.dist-info/METADATA
and remove the following lines and save it:
Requires-Dist: numpy (>=1.14)Requires-Dist: six (>=1.0.0)Requires-Dist: futures; python_version < "3.2"Requires-Dist: enum34 (>=1.1.6); python_version < "3.4"
The file pyarrow-0.15.1.dist-info/RECORD
contains sha256 digest and size of all other files inside this module. As the METADATA file was modified, its sha256 digest and size should be updated as well. Run the following code to calculate the sha256 digest:
pythonimport hashlibfrom pip._internal.utils.misc import read_chunksfrom base64 import urlsafe_b64encodedef rehash(path, blocksize=1 << 20):# type: (str, int) -> Tuple[str, str]"""Return (hash, length) for path using hashlib.sha256()"""h = hashlib.sha256()length = 0with open(path, 'rb') as f:for block in read_chunks(f, size=blocksize):length += len(block)h.update(block)digest = 'sha256=' + urlsafe_b64encode(h.digest()).decode('latin1').rstrip('=')# unicode/str python2 issuesreturn (digest, str(length))
Edit the pyarrow-0.15.1.dist-info/RECORD
file, find the line like the following:
pyarrow-0.15.1.dist-info/METADATA,sha256=xxxx
Replace the sha256 digest in this line with the output from the code snippet above. Save the file and zip the pyarrow
and pyarrow-0.15.1.dist-info
into one file and rename the zip file to the original name pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whl
. Run the Glue job with the modified whl
file in a private VPC with no internet access. The Glue job will now run successfully.
Processing ./glue-python-libs-tbnhyh5i/pyarrow-0.15.1-cp36-cp36m-manylinux2010_x86_64.whlInstalling collected packages: pyarrowSuccessfully installed pyarrow-0.15.1
Authored by Edward Liu, Darrell Chua, Tim Elson, and Andrew Perulero.