AWS half-baked deployment framework

July 21, 2023 - By Roman Revyakin

Preamble and the TLDR;

The road to a successful and easily reproducible AWS infrastructure deployment is riddled with pitfalls and you don’t want to introduce more of them by using an abstraction layer that brings even more pain and tears.

CDK brought with it the promises of developing infrastructure code using “real” programming language of one’s choice, being able to deploy it without having to step out from the development framework, and preview the changes prior to their actual deployment - though hardly delivered even one of them to its logical completion.

This kind of preamble might sound like yet another whining blog post from a grouchy DevOps person who perhaps “doesn’t want to learn” but hear me out: I did try to learn and I learned a lot, and I wish to share with you that what exactly I have learned during our team honest plunge into the world of CDK so that you can make an informed decision for yourself.

Having said that, if you are short on time, here is your TLDR; bullet point list:

Don’t use it! (If you are interested exactly why, I am afraid you will need to read on)
If you ought to use it (perhaps a requirement imposed on you) don’t do it in Python. Best, use the TypeScript/JavaScript since this is what the actual CDK generator is implemented in. It might mean learning another language, ouch
I also look at whether there are any benefits of using the CDK, and a couple of more perspectives on it, depending on what state your actual project is currently in, in the Summary

Polyglot swearing in vernacular

The project we worked on was a Python one and since all of us are more or less well versed in this awesome programming language, choosing one’s “CDK flavour” was obviously a no-brainer. No matter what one chooses for writing their CDK code in though, CDK is still a Node.js tool and so the code will automagically be interfaced to it through some sort of dark primal TypeScript/JavaScript voodoo, and in the case when things go south after one runs its cdk synth or cdk deploy they should be prepared to see errors like this one:

    Traceback (most recent call last):
      File "app.py", line 836, in <module>
        main()
      File "app.py", line 815, in main
        terminus_preview_subscriber.Infrastructure(
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_runtime.py", line 112, in __call__
        inst = super().__call__(*args, **kwargs)
      File "your/cdk/project/path/deployment/cdk/application/stack/some-cdk-stack.py", line 222, in __init__
        lambda_fn = self.define_lambda(execution_role,
      File "your/cdk/project/path/deployment/cdk/application/stack/some-cdk-stack.py", line 86, in define_lambda
        lambda_function: aws_lambda.Function = PythonFunction(
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_runtime.py", line 112, in __call__
        inst = super().__call__(*args, **kwargs)
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/aws_cdk/aws_lambda_python_alpha/__init__.py", line 1004, in __init__
        jsii.create(self.__class__, self, [scope, id, props])
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 337, in create
        args=_make_reference_for_native(self, args),
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 165, in _make_reference_for_native
        return [_make_reference_for_native(kernel, i) for i in d]
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 165, in <listcomp>
        return [_make_reference_for_native(kernel, i) for i in d]
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 182, in _make_reference_for_native
        "data": {
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 183, in <dictcomp>
        jsii_name: _make_reference_for_native(
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 205, in _make_reference_for_native
        kernel.create(d.__class__, d)
      File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 336, in create
        fqn=klass.__jsii_type__ or "Object",
    AttributeError: type object 'tuple' has no attribute '__jsii_type__'

In this particular case it was caused by a comma (,) at the end of the statement defining security groups list for Lambda function VPC settings. The inline assignment of security_groups argument was moved into a separate statement that would precede the Lambda function initialisation block and the comma wasn’t deleted from the end of it, causing the security_groups to actually become a tuple and the CDK to complain in this sort of wordy and not quite meaningful manner:

    from aws_cdk import aws_ec2, aws_lambda, aws_lambda_python_alpha
    ...
    security_groups = [aws_ec2.SecurityGroup(self, "some-name", vpc=vpc_instance)],  # Notice the comma at the end of the statement!
    lambda_function: aws_lambda.Function = aws_lambda_python_alpha.PythonFunction(
        scope=self,
        ...
        # VPC
        vpc=vpc_instance,
        vpc_subnets=subnet_selection if vpc_instance else None,
        security_groups=security_groups,                        # The comma made its way up from here
        # Runtime
        runtime=aws_lambda.Runtime.PYTHON_3_8,
        ...
    )

One might make a pointy remark that this type of issues should be caught at the coding stage by noticing the types mismatch through a static type checker, and would be correct except for the fact that Python doesn’t really have static types and all the type specification annotations are really just that - annotations. So while handy, such type checkers as mypy, pyright and others do sort of their “best effort” - at least in my experience. They catch some of the things described above but they often complain about totally legit stuff as well - happening quite often with the CDK complex type system. So while working with CDK in my neovim setup with pyright I got used to seeing some pretty consistent “red” marks across my code such as:

    Argument of type "Role" cannot be assigned to parameter "role" of type "IRole | None" in function "__init__"
      Type "Role" cannot be assigned to type "IRole | None"
        "Role" is incompatible with protocol "IRole"
          "grant_assume_role" is an incompatible type
            Type "(identity: IPrincipal) -> Grant" cannot be assigned to type "(grantee: IPrincipal) -> Grant"
              Parameter name mismatch: "grantee" versus "identity"
          "grant_pass_role" is an incompatible type
            Type "(identity: IPrincipal) -> Grant" cannot be assigned to type "(grantee: IPrincipal) -> Grant"
              Parameter name mismatch: "grantee" versus "identity"

Accordingly it was very easy to miss one more of this kind:

    Expression of type "tuple[list[SecurityGroup]]" cannot be assigned to declared type "List[SecurityGroup]"
    ...

Once again, the premise of CDK was to allow one to use programming languages and their best coding practices in order to declare AWS infrastructure. In the particular case of Python there seems to be a clear abuse of its typing system to shoehorn JavaScript style interfaces into the language. What is the point of those Role/IRole, Function/IFunction and the like type/interface separation? In my opinion it serves no useful purpose and on the top of it has no real representation in the actual CloudFormation YAML/JSON templates - of which I shall speak in a bit. It’s just a pure distraction and hassle when it comes to the static type checks.

KISS-ing the CDK goodnight for the better

On the aws_cdk.aws_lambda_python_alpha.PythonFunction itself - it proved useful for packaging Lambdas which required the installation of the third-party module dependencies prior to its deployment while remaining within the CDK framework. Its close cousin aws_cdk.aws_lambda_python_alpha.PythonLayerVersion appeared more troublesome, placing the contents of the installed third-party modules under the python subdirectory alongside all local files which didn’t belong there at all. This behaviour broke one of our layers so we had to come up with some solution.

And yes: the aws_cdk.aws_lambda_python_alpha.PythonFunction API docs page clearly states “experimental” in its description - and this is yet another point of concern, since its direct predecessor aws_cdk.aws-lambda-python.PythonFunction did not have alpha in its name in CDK version 1.x but shared with its offspring the same sad “experimental” state. Which in my view highlights another issue with CDK, covered more in detail in the next section.

Somehow after working in CDK for a while I found myself more and more “thinking inside the box”, reaching out to all those AWS libraries for “solutions” and forgetting that underlying the “template generation” phase is just plain good old Python. “Packaging” any piece of code be it Lambda or Lambda layer requiring third-party modules can obviously be arranged by running any package installation command inside a subprocess:

    import pathlib, subprocess
    ...
    cwd = str(pathlib.Path(__file__).parent.absolute())
    package_command = ("docker run --rm --entrypoint /bin/bash -v "
        + cwd
        + "/lambda-layer/layer-name:/lambda-layer python:3.7 -c "
        + "'pip3 install -r /lambda-layer/requirements.txt -t /lambda-layer/python'")
    subprocess.run(package_command, shell=True, check=True)

And then the folder that has been populated with the required libraries could be supplied to the LayerVersion’s code argument through AssetCode, like this:

    from aws_cdk import aws_lambda
    ...
    lambda_layer = aws_lambda.LayerVersion(
        self,
        f"Name-Of-The--LambdaLayer",
        code=aws_lambda.AssetCode(
            path='stack/lambda-layer/layer-name'
        ),
        compatible_runtimes=[aws_lambda.Runtime.PYTHON_3_7],
    )

We have got rid of all PythonLayerVersion and PythonFunction invocations from our code using this “hack”. One might argue that the demonstrated example serves an argument in favour of using the CDK, however I’d counter that it is exactly the case against it. As I have always been totally free to use whatever tools, utilities and programming languages I prefer to deploy my AWS infrastructure. I could use Python, Go, or Bash to do exactly the same thing I’ve done above and then employ AWS CLI, Boto3, CodeBuild with CodePipeline or even Terraform to deploy my Lambdas and whatnot.

Moving too fast and causing much havoc

Doubtlessly the very first thing one is faced with when starting using any new tooling is its “how to” and to be blatantly honest, the CDK’s one is pretty much non-existent. Yes, there are all the “Getting started” resources with (arguably) useful recommendations around setting development environment up, “examples”, and “patterns” and such. But once one deep dives into the actual code comprising of several interconnected stacks and tries to make sense of how to make them all work with each other - they are pretty much on their own. The CDK API Reference which I believe should be a reference guide for anyone writing their CDK code looks more like a collection of notes “on the back of the envelope” - as if someone developing “yet another feature” jotted down some quick references for themselves and those “in the know” and never had time of coming back to fix them. Code examples are not complete, with variables not initialised (so their type would be explicitly stated), there are references to the code which is not provided in the example itself, some object attributes and methods are not properly described, and so on and so forth.

The pace at which new versions of CDK are being released is quite fast. Aside from all the (hopefully) continuous improvements to the codebase they are very likely caused by the fact that each time there is a new attribute being introduced into any of the CloudFormation resources on the AWS side that same attribute needs to be implemented in the CDK itself in order for the code to compile. Which brings in another layer of management of the infrastructure code: making sure one doesn’t introduce instability or regression into one’s deployments by following quite demanding upgrade schedules.

It’s just a nice façade in front of the beaten CloudFormation

AWS CDK guide boasts how one can get done more with less keystrokes. I say while it is achievable in some instances, it is at best arguable for the others. Besides I’d add to it a word of caution that it might also come with less confidence at the deployment time when their code is run. Ultimately, all the programming constructs inevitably end up in a CloudFormation template with a lot of “metadata” noise and the dynamically generated logical IDs, if one is using the so called “higher level of abstraction” objects (the ones which are said to possess “rich” APIs) - in other words, not the “flat”/“Level 1” representations of the CloudFormation construct counterparts. Thus the templates become actually more bloated, with less human-readable references and hence much harder to debug. And if you think this sort of “low level” debugging is not needed with CDK you are going to be truly and strikenly disappointed: debugging you will need. At the end of the day what gets sent to AWS CloudFormation is not the code but the CloudFormation template generated by it - hence be prepared for a lot of “runtime errors” in addition to the “compile” ones mentioned earlier. (Who asked for the “true programming infrastructure” paradigm?)

So if you thought to part with that “YAML engineer” title for good, take heart. As from now on you will need to become an expert in two things at once: the CDK (compiling) and CloudFormation (runtime). Oh, and did I mention that the templates which are actually sent to AWS are JSON documents, unlike the ones displayed on your screen with the handy cdk synth command? Sure, one can always convert them back to YAML in a one liner command however the resulting YAML will still look quite ugly since all string concatenations in the template use Fn::Join intrinsic function instead of a more friendly Fn::Sub. Lo and behold: JSON template would make sense if after writing my CDK code I did not have a need to look at the end result. It is however absolutely not the case.

Nested stack CDK deployments are (almost) “black boxes”

So, it’s kind of nice to have the cdk synth command to be able to pre-synthesise the CloudFormation template in order to preview the stack that is going to be deployed. One needs to keep in mind though that this feature doesn’t quite work for the nested stacks: by their very nature the nested stacks are generated during the deployment of the master stack, and in case of plain CloudFormation it’s all nice and clear since the templates for the nested stacks are pre-uploaded into the S3 bucket and one knows exactly what their templates are - prior to the deployment happening. Not so in case of the CDK generated templates. While it is possible to preview the template for the master “parent” stack using the cdk synth, there are some additional hoops to jump through in order to get one’s hands onto the templates of the nested stacks, and it requires some knowledge of how the cdk CLI works (or some digging into its options list) to figure it out. By default cdk first generates and stores all templates in cdk.out subfolder of the directory where it is run. Therefore it is possible to invoke one of the “dry-run” commands - say cdk ls - and peek into that location to find the templates of the parent and all of its nested “children” prior to their “birth” - just be mindful that the filenames will be a clump of the name of the parent stack, their own “ID” and some pseudo-random gobbledygook appended at the very end of the name. In other words, not very readable indeed. Also, those filenames will not be the same as the names of the actual nested stacks that are going to be eventually deployed (don’t ask me why) and as I said earlier, their contents will be JSON, not YAML.

Once a nested stack is deployed, its maintenance will become easier with the help of the useful command cdk diff <name of the master stack> - which also takes the options of -e (perform a diff exclusively for the stack in question) and --context-lines <number> - as the difference between the already deployed stack and the requested update will be displayed without having to look for that template inside the cdk.out directory and performing the diff on your own. But yes, as you might have rightly guessed, it will be a JSON diff, not a YAML one.

One can obviously establish a convention of never using nested stacks with the CDK - especially taking into account the observation that nested stacks were essentially a “poor man’s” attempt to parameterise complex deployments by compartmentalising them into small reusable CloudFormation templates - in other words, make them reproducible and “programmable” to an extent - the task which CDK should now deal with. Nevertheless it seems that sometimes this goal is not attainable - at least that was our experience when trying to “un-nest” a pre-existing API Gateway stack with a custom domain setting from a larger VPC-DNS-API formation. When trying to make the API Gateway stack standalone we were getting a fully reproducible error

“API Gateway could not successfully write to CloudWatch Logs using the ARN specified. Rate exceeded”

And we couldn’t get to the bottom of the issue. The only difference in the synthesised stacks of standalone API Gateway without custom domain name (which deployed alright) and with the custom domain name setting (which was failing) was the obvious AWS::Route53::RecordSet with a reference in the still nested DNS stack to the Route53 hosted zone deployed in there.

Don’t treat it like a program - because it isn’t

In the beginning, it doesn’t take long to get oneself carried away when starting to work with the CDK: if you think of it, the “infrastructure code” looks like a program, feels like program, so it must be a program? Right? Well, not quite so, as there are some limits to it - which eventually will be forming that proverbial “box” mentioned earlier. It always pays to keep in the back of the mind all CloudFormation restrictions: for instance, it is impossible to pass one resource attributes to another if stacks in which they are deployed reside in different regions (ACM certificate for a CloudFront distribution, anyone?).

Perhaps you wanted to perform some conditional deployment based on whether some of the resources being already present in AWS? Well, technically - you can. Practically though you will need to fork out your discovery code to AWS SDK and treat it like a separate entity: supply a separate set of credentials to authenticate against AWS for your SDK calls, and then somehow pass the information back to CDK. After all, your CDK code does not directly interact with AWS resources: it is there only to generate the template and pass it over to AWS CloudFormation for deployment.

All the CloudFormation limits on the resource names are obviously still present, so in case one wanted to supply their own instead of the auto-generated ones, there is a need to keep an eye for this one as well - and yep, it is not being checked at the “compile” time.

CDK rich API objects are a nuisance, not simplification

Here’s where things get really bad. The touted “terseness” of the code which uses high-level abstraction CDK objects with “rich APIs” hits back at you when you discover that it is not possible to extract easily some information about the resources which form part of your overall stack. Say you deployed your VPC with subnets, VPC endpoints, route tables and NAT gateway as part of a single stack using a handy aws_ec2.Vpc object which allows one to set everything up conveniently by passing a few parameters at the object’s initialisation time. Imagine there is another stack (for instance an API Gateway) and there is a need to pass it the NAT Gateway IPs in order to whitelist them in the resource policy, but aws_ec2.Vpc doesn’t have an attribute to refer to the NAT resources to get those back. Now you’re in trouble. You should either use the CustomResource Lambda that would be checking the Elastic IPs of the NAT Gateways upon their creation and returning them back to be exported as stack outputs, or go back and rewrite your VPC provisioning procedure as a set of the “low-level” Cfn* objects: CfnEIP, CfnNatGateway, CfnRoute, CfnRouteTable and make your sought for EIPs exported in a simple stack output export. So much for “conveniently avoiding” taking care of your basic building blocks.

But wait, there’s more! Remember those logical resource IDs which are (again) “conveniently” generated by the “high-level” CDK objects with “rich APIs”? Guess what? They can suddenly change in between your seemingly innocuous code refactoring. This is a now 3 year (!) old nasty CDK bug which makes the use of the objects with “rich APIs” a very dangerous (if not outright deadly) enterprise. Imagine running a deployment of your stack and suddenly realising it destroys all your perfectly functioning resources in order to deploy them anew due to their logical IDs being updated. That happened to us, too. A workaround for Node.js is documented in one of the comments to this GitHub issue and one of our engineers implemented a similar hack with using a “reflection” of <resource>.node.default_child.override_logical_id(). Obviously, this has to be implemented consistently for all of the resources created by your CDK code to make it fail proof, or - alternatively - one can drop the use of the “high level” CDK objects altogether in favour of their uglier but safer and more reliable Cfn* cousins, however that would essentially mean writing YAML in Python (or whatever other language rocks your boat in stormy CDK waters).

Curiousier things

There are also some “curiousier” things in CDK if I may quote the “Alice in Wonderland”. When bootstrapping the CDK in the AWS account - which is one of the first things to do before it is possible to deploy anything with it - there is a requirement to provide a unique string through the --qualifier option, or otherwise the CDK will use the default hnb659fds, the “value that has no significance” as is casually mentioned on the linked documentation page. The bootstrapping process creates a CloudFormation stack with a few resouces such as IAM policies and roles for use with CDK deployments, S3 bucket to store the uploaded templates and assets and an SSM parameter /cdk-bootstrap/your-string/version which value reflects the “template version number” from the table provided under the “Template history” table on the bootstrapping documentation page (roughly corresponding to the CDK version security related changes). Coming back to the point expressed in the section devoted to haste the changes are introduced to CDK framework, these table entries deserve some separate consideration:

    16 2.69.0 Addresses Security Hub finding KMS.2.
    ...
    18 2.80.0 Reverted changes made for version 16 as they don't work in all partitions and are are not recommended.

The value of the “qualifier” string should be the same as set in the cdk.json settings file, as all of your CloudFormation stacks to be deployed through cdk deploy command will automatically receive a parameter BootstrapVersion that will be pointing at this SSM parameter and auto-resolve on each deployment, i.e.:

    Parameters:
      BootstrapVersion:
        Type: AWS::SSM::Parameter::Value<String>
        Default: /cdk-bootstrap/<your string>/version
        Description: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]

The only place the resolved value of the parameter is used is this block of generated CloudFormation template:

    Rules:
      CheckBootstrapVersion:
        Assertions:
          - Assert:
              Fn::Not:
                - Fn::Contains:
                    - - "1"
                      - "2"
                      - "3"
                      - "4"
                      - "5"
                    - Ref: BootstrapVersion
            AssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.

Yes, you read it right - it’s just checking that the parameter supplied to the template is of value higher or equal than 6. It might look fairly harmless if pointless, but if one deployed a stack with one value of the “qualifier” string and then - for whatever reason decided to change it on the next deployment, even though the generated template will contain a default value for the BootstrapVersion parameter set to the new value, then old one will still be provided through the actual parameters to the CloudFormation stack, so the default will be ignored - unless an obscure option --no-previous-parameters is provided to cdk deploy:

    --previous-parameters  Use previous values for existing parameters (you
                           must specify all parameters on every deployment if
                           this is disabled)         [boolean] [default: true]

If you think it’s quite rare that someone will ever step on this rake lying in high grass, just perform an internet search with your favourite indexing engine for the following phrase:

    Unable to fetch parameters /cdk-bootstrap/hnb659fds/version

And yes, we also got caught by this trap! There is a way to get rid of this ugly and unnecessary appendage to the template by providing your own “stack synthesizer” in each of your stack definitions, and assigning the property generate_bootstrap_version_rule the boolean value of False, as described on the same documentation page about bootstrapping under the “Customizing synthesis” section. Cheers to the “simplicity and convenience” of your CDK code!

I hope that at this point it should become clear that wayfarers of the CDK yellow brick road in the quest of getting themselves rid of the “ugly YAML CloudFormation templates” and cherishing a hope of reaching the ever vanishing horizons of fully programmable and testable infrastructure, will find themselves in grave need of learning:

The CDK framework
The CloudFormation (including the YAML and JSON flavoured templates)
And the inner workings of CDK itself

Oh, and be on the constant watch-out for any breaking and unexpected changes.

Are there any roses in these boonies of thorns?

There obviously are. CDK does have some really nice applications, say if one needs to deploy a few CloudWatch alarms or the dashboards for the resources which have been instantiated with the “rich API” objects: as can be readily seen from the Lambda CloudWatch dashboard example it is a matter of calling a metric_<name> method to add a particular widget on the dashboard. It is certainly shorter, easier and less error-prone than having to provide all metric parameters such as the namespace, name of the metric and its dimensions along with all other different but optional arguments. Writing a step function is also a much nicer experience with the blocks of code which can be reused and parameterised rather than having to repeat oneself in YAML with a fairly obscure syntax on the top.

To summarise, for a deployment of any set of similarly configured resources - in other words, with repetitious, cookie-cutter deployments CDK is going to have a upper hand, which is obviously expected. Is it worth all the troubles though? Well, it’s up to you to decide. If I was in a position to advise someone on whether to use the CDK or not for their next project, I’d be coming from a couple of different perspectives:

Are you trying to transition from “plain CloudFormation” to CDK?
Or is it a new project and there has not yet been a decision made on the infrastructure-as-code framework?

In the first situation I would suggest to write down all problems which you currently experience with the CloudFormation, and contrast them with those of the CDK outlined in this blog post. It might happen you will want to reconsider the initial excitement. For the second set of circumstances, I would consider some worthy alternatives, such as Terraform, Pulumi or even - yeah - again, the old but tried CloudFormation. It will all depend on your team composition, skills and experience. Quoting one wise advice on the financial matters I received some time ago which I believe holds true for most endeavours: settle on a strategy that brings you most benefits, in the least amount of time with the lowest level of aggravation.