The road to a successful and easily reproducible AWS infrastructure deployment is riddled with pitfalls and you don’t want to introduce more of them by using an abstraction layer that brings even more pain and tears.
CDK brought with it the promises of developing infrastructure code using "real" programming language of one’s choice, being able to deploy it without having to step out from the development framework, and preview the changes prior to their actual deployment - though hardly delivered even one of them to its logical completion.
This kind of preamble might sound like yet another whining blog post from a grouchy DevOps person who perhaps "doesn’t want to learn" but hear me out: I did try to learn and I learned a lot, and I wish to share with you that what exactly I have learned during our team honest plunge into the world of CDK so that you can make an informed decision for yourself.
Having said that, if you are short on time, here is your TLDR; bullet point list:
The project we worked on was a Python one and since all of us are more or less
well versed in this awesome programming language, choosing one’s "CDK flavour"
was obviously a no-brainer. No matter what one chooses for writing their CDK
code in though, CDK is still a Node.js
tool and so the code will
automagically be interfaced to it through some sort of dark primal
TypeScript/JavaScript voodoo, and in the case when things go south after one
runs its cdk synth
or cdk deploy
they should be prepared to see errors like
this one:
bashTraceback (most recent call last):File "app.py", line 836, in <module>main()File "app.py", line 815, in mainterminus_preview_subscriber.Infrastructure(File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_runtime.py", line 112, in __call__inst = super().__call__(*args, **kwargs)File "your/cdk/project/path/deployment/cdk/application/stack/some-cdk-stack.py", line 222, in __init__lambda_fn = self.define_lambda(execution_role,File "your/cdk/project/path/deployment/cdk/application/stack/some-cdk-stack.py", line 86, in define_lambdalambda_function: aws_lambda.Function = PythonFunction(File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_runtime.py", line 112, in __call__inst = super().__call__(*args, **kwargs)File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/aws_cdk/aws_lambda_python_alpha/__init__.py", line 1004, in __init__jsii.create(self.__class__, self, [scope, id, props])File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 337, in createargs=_make_reference_for_native(self, args),File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 165, in _make_reference_for_nativereturn [_make_reference_for_native(kernel, i) for i in d]File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 165, in <listcomp>return [_make_reference_for_native(kernel, i) for i in d]File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 182, in _make_reference_for_native"data": {File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 183, in <dictcomp>jsii_name: _make_reference_for_native(File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 205, in _make_reference_for_nativekernel.create(d.__class__, d)File "your/cdk/project/path/bin/python/venv/lib/python3.8/site-packages/jsii/_kernel/__init__.py", line 336, in createfqn=klass.__jsii_type__ or "Object",AttributeError: type object 'tuple' has no attribute '__jsii_type__'
In this particular case it was caused by a comma (,
) at the end of the statement
defining security groups list for Lambda function VPC settings. The inline
assignment of security_groups
argument was moved into a separate statement
that would precede the Lambda function initialisation block and the comma wasn’t
deleted from the end of it, causing the security_groups
to actually become a
tuple and the CDK to complain in this sort of wordy and not quite meaningful
manner:
pythonfrom aws_cdk import aws_ec2, aws_lambda, aws_lambda_python_alpha...security_groups = [aws_ec2.SecurityGroup(self, "some-name", vpc=vpc_instance)], # Notice the comma at the end of the statement!lambda_function: aws_lambda.Function = aws_lambda_python_alpha.PythonFunction(scope=self,...# VPCvpc=vpc_instance,vpc_subnets=subnet_selection if vpc_instance else None,security_groups=security_groups, # The comma made its way up from here# Runtimeruntime=aws_lambda.Runtime.PYTHON_3_8,...)
One might make a pointy remark that this type of issues should be caught at
the coding stage by noticing the types mismatch through a static type checker,
and would be correct except for the fact that Python doesn’t really have
static types and all the type specification annotations are really just that -
annotations. So while handy, such type checkers as mypy
, pyright
and
others do sort of their "best effort" - at least in my experience. They catch
some of the things described above but they often complain about totally legit
stuff as well - happening quite often with the CDK complex type
system. So while working with CDK in my neovim
setup with pyright
I got
used to seeing some pretty consistent "red" marks across my code such as:
bashArgument of type "Role" cannot be assigned to parameter "role" of type "IRole | None" in function "__init__"Type "Role" cannot be assigned to type "IRole | None""Role" is incompatible with protocol "IRole""grant_assume_role" is an incompatible typeType "(identity: IPrincipal) -> Grant" cannot be assigned to type "(grantee: IPrincipal) -> Grant"Parameter name mismatch: "grantee" versus "identity""grant_pass_role" is an incompatible typeType "(identity: IPrincipal) -> Grant" cannot be assigned to type "(grantee: IPrincipal) -> Grant"Parameter name mismatch: "grantee" versus "identity"
Accordingly it was very easy to miss one more of this kind:
bashExpression of type "tuple[list[SecurityGroup]]" cannot be assigned to declared type "List[SecurityGroup]"...
Once again, the premise of CDK was to allow one to use programming languages and
their best coding practices in order to declare AWS infrastructure. In the
particular case of Python there seems to be a clear abuse of its typing system
to shoehorn JavaScript style interfaces into the language. What is the
point of those Role
/IRole
, Function
/IFunction
and the like
type/interface separation? In my opinion it serves no useful purpose and on the
top of it has no real representation in the actual CloudFormation YAML/JSON
templates - of which I shall speak in a bit. It's just a pure distraction and
hassle when it comes to the static type checks.
On the aws_cdk.aws_lambda_python_alpha.PythonFunction
itself - it proved
useful for packaging Lambdas which required the installation of the
third-party module dependencies prior to its deployment while remaining within
the CDK framework. Its close cousin
aws_cdk.aws_lambda_python_alpha.PythonLayerVersion
appeared more troublesome,
placing the contents of the installed third-party modules under the python
subdirectory alongside all local files which didn’t belong there at all.
This behaviour broke one of our layers so we had to come up with some solution.
And yes: the
aws_cdk.aws_lambda_python_alpha.PythonFunction
API docs page clearly states "experimental" in its description - and this is
yet another point of concern, since its direct predecessor
aws_cdk.aws-lambda-python.PythonFunction
did not have alpha
in its name in CDK version 1.x but shared with its
offspring the same sad "experimental" state. Which in my view highlights another
issue with CDK, covered more in detail in the next section.
Somehow after working in CDK for a while I found myself more and more "thinking
inside the box", reaching out to all those AWS libraries for "solutions" and
forgetting that underlying the "template generation" phase is just plain good old
Python. "Packaging" any piece of code be it Lambda or Lambda layer
requiring third-party modules can obviously be arranged by running any package
installation command inside a subprocess
:
pythonimport pathlib, subprocess...cwd = str(pathlib.Path(__file__).parent.absolute())package_command = ("docker run --rm --entrypoint /bin/bash -v "+ cwd+ "/lambda-layer/layer-name:/lambda-layer python:3.7 -c "+ "'pip3 install -r /lambda-layer/requirements.txt -t /lambda-layer/python'")subprocess.run(package_command, shell=True, check=True)
And then the folder that has been populated with the required libraries could be
supplied to the LayerVersion
’s code
argument through AssetCode
, like this:
pythonfrom aws_cdk import aws_lambda...lambda_layer = aws_lambda.LayerVersion(self,f"Name-Of-The--LambdaLayer",code=aws_lambda.AssetCode(path='stack/lambda-layer/layer-name'),compatible_runtimes=[aws_lambda.Runtime.PYTHON_3_7],)
We have got rid of all PythonLayerVersion
and PythonFunction
invocations
from our code using this "hack". One might argue that the demonstrated example
serves an argument in favour of using the CDK, however I’d counter that it is
exactly the case against it. As I have always been totally free to use
whatever tools, utilities and programming languages I prefer to deploy my AWS
infrastructure. I could use Python, Go, or Bash to do exactly the same thing
I’ve done above and then employ AWS CLI, Boto3, CodeBuild with CodePipeline or
even Terraform to deploy my Lambdas and whatnot.
Doubtlessly the very first thing one is faced with when starting using any new tooling is its "how to" and to be blatantly honest, the CDK’s one is pretty much non-existent. Yes, there are all the "Getting started" resources with (arguably) useful recommendations around setting development environment up, "examples", and "patterns" and such. But once one deep dives into the actual code comprising of several interconnected stacks and tries to make sense of how to make them all work with each other - they are pretty much on their own. The CDK API Reference which I believe should be a reference guide for anyone writing their CDK code looks more like a collection of notes "on the back of the envelope" - as if someone developing "yet another feature" jotted down some quick references for themselves and those "in the know" and never had time of coming back to fix them. Code examples are not complete, with variables not initialised (so their type would be explicitly stated), there are references to the code which is not provided in the example itself, some object attributes and methods are not properly described, and so on and so forth.
The pace at which new versions of CDK are being released is quite fast. Aside from all the (hopefully) continuous improvements to the codebase they are very likely caused by the fact that each time there is a new attribute being introduced into any of the CloudFormation resources on the AWS side that same attribute needs to be implemented in the CDK itself in order for the code to compile. Which brings in another layer of management of the infrastructure code: making sure one doesn’t introduce instability or regression into one’s deployments by following quite demanding upgrade schedules.
AWS CDK guide boasts how one can get done more with less keystrokes. I say while it is achievable in some instances, it is at best arguable for the others. Besides I’d add to it a word of caution that it might also come with less confidence at the deployment time when their code is run. Ultimately, all the programming constructs inevitably end up in a CloudFormation template with a lot of "metadata" noise and the dynamically generated logical IDs, if one is using the so called "higher level of abstraction" objects (the ones which are said to possess "rich" APIs) - in other words, not the "flat"/"Level 1" representations of the CloudFormation construct counterparts. Thus the templates become actually more bloated, with less human-readable references and hence much harder to debug. And if you think this sort of "low level" debugging is not needed with CDK you are going to be truly and strikenly disappointed: debugging you will need. At the end of the day what gets sent to AWS CloudFormation is not the code but the CloudFormation template generated by it - hence be prepared for a lot of "runtime errors" in addition to the "compile" ones mentioned earlier. (Who asked for the "true programming infrastructure" paradigm?)
So if you thought to part with that "YAML engineer" title for good, take heart.
As from now on you will need to become an expert in two things at once: the CDK
(compiling) and CloudFormation (runtime). Oh, and did I mention that the
templates which are actually sent to AWS are JSON documents, unlike the ones
displayed on your screen with the handy cdk synth
command? Sure, one can
always convert them back to YAML in a one liner command however the resulting
YAML will still look quite ugly since all string concatenations in the template
use Fn::Join
intrinsic function instead of a more friendly Fn::Sub
. Lo and
behold: JSON template would make sense if after writing my CDK code I did not
have a need to look at the end result. It is however absolutely not the case.
So, it’s kind of nice to have the cdk synth
command to be able to
pre-synthesise the CloudFormation template in order to preview the stack that is
going to be deployed. One needs to keep in mind though that this feature doesn’t
quite work for the nested stacks: by their very nature the nested stacks are
generated during the deployment of the master stack, and in case of plain
CloudFormation it’s all nice and clear since the templates for the nested stacks
are pre-uploaded into the S3 bucket and one knows exactly what their templates
are - prior to the deployment happening. Not so in case of the CDK generated
templates. While it is possible to preview the template for the master "parent"
stack using the cdk synth
, there are some additional hoops to jump through in
order to get one’s hands onto the templates of the nested stacks, and it
requires some knowledge of how the cdk
CLI works (or some digging into its
options list) to figure it out. By default cdk
first generates and stores all
templates in cdk.out
subfolder of the directory where it is run. Therefore it
is possible to invoke one of the "dry-run" commands - say cdk ls
- and peek
into that location to find the templates of the parent and all of its nested
"children" prior to their "birth" - just be mindful that the filenames will be a
clump of the name of the parent stack, their own "ID" and some pseudo-random
gobbledygook appended at the very end of the name. In other words, not very
readable indeed. Also, those filenames will not be the same as the names of
the actual nested stacks that are going to be eventually deployed (don’t ask me
why) and as I said earlier, their contents will be JSON, not YAML.
Once a nested stack is deployed, its maintenance will become easier with the
help of the useful command cdk diff <name of the master stack>
- which also
takes the options of -e
(perform a diff exclusively for the stack in
question) and --context-lines <number>
- as the difference between the
already deployed stack and the requested update will be displayed without having
to look for that template inside the cdk.out
directory and performing the diff
on your own. But yes, as you might have rightly guessed, it will be a JSON
diff, not a YAML one.
One can obviously establish a convention of never using nested stacks with the CDK - especially taking into account the observation that nested stacks were essentially a "poor man’s" attempt to parameterise complex deployments by compartmentalising them into small reusable CloudFormation templates - in other words, make them reproducible and "programmable" to an extent - the task which CDK should now deal with. Nevertheless it seems that sometimes this goal is not attainable - at least that was our experience when trying to "un-nest" a pre-existing API Gateway stack with a custom domain setting from a larger VPC-DNS-API formation. When trying to make the API Gateway stack standalone we were getting a fully reproducible error
"API Gateway could not successfully write to CloudWatch Logs using the ARN specified. Rate exceeded"
And we couldn’t get to the bottom of the issue. The only difference in the
synthesised stacks of standalone API Gateway without custom domain name (which
deployed alright) and with the custom domain name setting (which was failing)
was the obvious AWS::Route53::RecordSet
with a reference in the still nested
DNS stack to the Route53 hosted zone deployed in there.
In the beginning, it doesn’t take long to get oneself carried away when starting to work with the CDK: if you think of it, the "infrastructure code" looks like a program, feels like program, so it must be a program? Right? Well, not quite so, as there are some limits to it - which eventually will be forming that proverbial "box" mentioned earlier. It always pays to keep in the back of the mind all CloudFormation restrictions: for instance, it is impossible to pass one resource attributes to another if stacks in which they are deployed reside in different regions (ACM certificate for a CloudFront distribution, anyone?).
Perhaps you wanted to perform some conditional deployment based on whether some of the resources being already present in AWS? Well, technically - you can. Practically though you will need to fork out your discovery code to AWS SDK and treat it like a separate entity: supply a separate set of credentials to authenticate against AWS for your SDK calls, and then somehow pass the information back to CDK. After all, your CDK code does not directly interact with AWS resources: it is there only to generate the template and pass it over to AWS CloudFormation for deployment.
All the CloudFormation limits on the resource names are obviously still present, so in case one wanted to supply their own instead of the auto-generated ones, there is a need to keep an eye for this one as well - and yep, it is not being checked at the "compile" time.
Here’s where things get really bad. The touted "terseness" of the code which
uses high-level abstraction CDK objects with "rich APIs" hits back at you when
you discover that it is not possible to extract easily some information about
the resources which form part of your overall stack. Say you deployed your VPC
with subnets, VPC endpoints, route tables and NAT gateway as part of a single
stack using a handy aws_ec2.Vpc
object which allows one to set everything up
conveniently by passing a few parameters at the object’s initialisation time.
Imagine there is another stack (for instance an API Gateway) and there is a need
to pass it the NAT Gateway IPs in order to whitelist them in the resource
policy, but
aws_ec2.Vpc
doesn’t have an attribute to refer to the NAT resources to get those back. Now
you’re in trouble. You should either use the CustomResource Lambda that would be
checking the Elastic IPs of the NAT Gateways upon their creation and returning them
back to be exported as stack outputs, or go back and rewrite your VPC
provisioning procedure as a set of the "low-level" Cfn*
objects: CfnEIP
,
CfnNatGateway
, CfnRoute
, CfnRouteTable
and make your sought for EIPs
exported in a simple stack output export. So much for "conveniently avoiding"
taking care of your basic building blocks.
But wait, there’s more! Remember those logical resource IDs which are (again)
"conveniently" generated by the "high-level" CDK objects with "rich
APIs"? Guess what? They can suddenly change in between your seemingly innocuous
code refactoring. This is a now 3 year (!) old nasty CDK
bug which makes the use of the
objects with "rich APIs" a very dangerous (if not outright deadly) enterprise.
Imagine running a deployment of your stack and suddenly realising it destroys
all your perfectly functioning resources in order to deploy them anew due to
their logical IDs being updated. That happened to us, too. A workaround for
Node.js is documented in one of the
comments
to this GitHub issue and one of our engineers implemented a similar hack with
using a "reflection" of <resource>.node.default_child.override_logical_id()
.
Obviously, this has to be implemented consistently for all of the resources
created by your CDK code to make it fail proof, or - alternatively - one can
drop the use of the "high level" CDK objects altogether in favour of their
uglier but safer and more reliable Cfn*
cousins, however that would
essentially mean writing YAML in Python (or whatever other language rocks your
boat in stormy CDK waters).
There are also some "curiousier" things in CDK if I may quote the "Alice in
Wonderland". When bootstrapping
the CDK in the
AWS account - which is one of the first things to do before it is possible to
deploy anything with it - there is a requirement to provide a unique string
through the --qualifier
option, or otherwise the CDK will use the default
hnb659fds
, the "value that has no significance" as is casually mentioned on
the linked documentation page. The bootstrapping process creates a
CloudFormation stack with a few resouces such as IAM policies and roles for use
with CDK deployments, S3 bucket to store the uploaded templates and assets and
an SSM parameter /cdk-bootstrap/your-string/version
which value reflects the
"template version number" from the table provided under the "Template history" table
on the bootstrapping documentation page (roughly corresponding to the CDK
version security related changes). Coming back to the point expressed in the
section devoted to haste the changes are introduced to CDK framework, these table
entries deserve some separate consideration:
16 2.69.0 Addresses Security Hub finding KMS.2....18 2.80.0 Reverted changes made for version 16 as they don't work in all partitions and are are not recommended.
The value of the "qualifier" string should be the same as set in the cdk.json
settings file, as all of your CloudFormation stacks to be deployed through
cdk deploy
command will automatically receive a parameter BootstrapVersion
that will be pointing at this SSM parameter and auto-resolve on each deployment,
i.e.:
yamlParameters:BootstrapVersion:Type: AWS::SSM::Parameter::Value<String>Default: /cdk-bootstrap/<your string>/versionDescription: Version of the CDK Bootstrap resources in this environment, automatically retrieved from SSM Parameter Store. [cdk:skip]
The only place the resolved value of the parameter is used is this block of generated CloudFormation template:
yamlRules:CheckBootstrapVersion:Assertions:- Assert:Fn::Not:- Fn::Contains:- - "1"- "2"- "3"- "4"- "5"- Ref: BootstrapVersionAssertDescription: CDK bootstrap stack version 6 required. Please run 'cdk bootstrap' with a recent version of the CDK CLI.
Yes, you read it right - it’s just checking that the parameter supplied
to the template is of value higher or equal than 6
. It might look fairly
harmless if pointless, but if one deployed a stack with one value of the
"qualifier" string and then - for whatever reason decided to change it on the
next deployment, even though the generated template will contain a default
value for the BootstrapVersion
parameter set to the new value, then old one
will still be provided through the actual parameters to the CloudFormation
stack, so the default will be ignored - unless an obscure option
--no-previous-parameters
is provided to cdk deploy
:
--previous-parameters Use previous values for existing parameters (youmust specify all parameters on every deployment ifthis is disabled) [boolean] [default: true]
If you think it’s quite rare that someone will ever step on this rake lying in high grass, just perform an internet search with your favourite indexing engine for the following phrase:
Unable to fetch parameters /cdk-bootstrap/hnb659fds/version
And yes, we also got caught by this trap! There is a way to get rid of this
ugly and unnecessary appendage to the template by providing your own "stack
synthesizer" in each of your stack definitions, and assigning the property
generate_bootstrap_version_rule
the boolean value of False
, as described on
the same documentation page about
bootstrapping
under the "Customizing synthesis" section. Cheers to the "simplicity and
convenience" of your CDK code!
I hope that at this point it should become clear that wayfarers of the CDK yellow brick road in the quest of getting themselves rid of the "ugly YAML CloudFormation templates" and cherishing a hope of reaching the ever vanishing horizons of fully programmable and testable infrastructure, will find themselves in grave need of learning:
Oh, and be on the constant watch-out for any breaking and unexpected changes.
There obviously are. CDK does have some really nice applications, say if one
needs to deploy a few CloudWatch alarms or the dashboards for the resources
which have been instantiated with the "rich API" objects: as can be readily seen
from the Lambda CloudWatch dashboard
example
it is a matter of calling a metric_<name>
method to add a particular widget on
the dashboard. It is certainly shorter, easier and less error-prone than
having to provide all metric parameters such as the namespace, name of the
metric and its dimensions along with all other different but optional
arguments. Writing a step function is also a much nicer experience with
the blocks of code which can be reused and parameterised rather than having to
repeat oneself in YAML with a fairly obscure syntax on the top.
To summarise, for a deployment of any set of similarly configured resources - in other words, with repetitious, cookie-cutter deployments CDK is going to have a upper hand, which is obviously expected. Is it worth all the troubles though? Well, it’s up to you to decide. If I was in a position to advise someone on whether to use the CDK or not for their next project, I’d be coming from a couple of different perspectives:
In the first situation I would suggest to write down all problems which you currently experience with the CloudFormation, and contrast them with those of the CDK outlined in this blog post. It might happen you will want to reconsider the initial excitement. For the second set of circumstances, I would consider some worthy alternatives, such as Terraform, Pulumi or even - yeah - again, the old but tried CloudFormation. It will all depend on your team composition, skills and experience. Quoting one wise advice on the financial matters I received some time ago which I believe holds true for most endeavours: settle on a strategy that brings you most benefits, in the least amount of time with the lowest level of aggravation.