Skip to main content

Troubleshooting Depot CDK

Sometimes CDK deployments can go wrong. This can happen for a myriad of reasons. For example, AWS account limits reached (e.g. maximum number of VPCs), or a misconfiguration in your CDK code, or resource access issue.

When these issues happen, and the underlying reason is not bubbled up and shown in your cdk deploy output, it can help to know where to look for logs and hints as to the reason.

First off, review the Logs and Troubleshooting section for the first steps to take on your investigation. Then read further below for further hints and tips.

Deployment failure in Depot Shared Environments

Depot 'shared' environments are those which host multiple Datasets as a part of a broader service. Datasets are deployed using nested CloudFormation stacks, which are typically deployed by individually managed 'service' CloudFormation stacks. The Depot deployment process automatically manages the nested stacks when you define Depot resources in your own CloudFormation stacks (typically managed with AWS CDK).

Things to avoid

warning

Important: Please never skip deletion of Depot resources when deleting your service stacks. (This is the prompt that can appear in CloudFormation if a previous update or delete fails). If the deletion failed - try again (without skipping resources) in case your deletion got bundled with another failed deployment. If it still fails, please contact your Depot platform for assistance. If you skip the deletion of Depot resources, you might end up with broken stack, the re-deployment of a similar/same stack will likely fail and manual intervention will be required anyway.

General deployment notices and information

Regarding failures and slowness of deployment. For (legacy) performance reasons, Depot can bundle multiple service deployments into a single attempt. If one of these fails, the whole bundle fails - so even if your deployment was OK, it might fail because another service on the shared environment failed their deployment. If the logs of your deployment show failures about another service, dataset, or Depot resource, just retry your deployment and hopefully it will succeed the next time. Though be reasonable (unless you are very unlucky to get bundled multiple times) and do not continue retrying too many times - ask your Depot platform team for assistance. This scenario can occur when an environment is used to share multiple services, and is quite busy with multiple teams working in parallel.

Shared environments can become quite large (many datasets, locations, etc...). At this stage, a build initiated by the Depot Bootstrap process (which deploys your environment changes) can take 15-20 minutes to complete. If another team pushes their own changes just before your changes are pushed, you might be 'stuck' behind this deployment process in a queue. Your deployment could therefore take for example: (in-flight Depot deployment) +20 (your Depot deployment) +x (your stack non-Depot-related resources) minutes.

Also to note, if your branching setup re-uses some resources from the main / master service branch deployments (such as shared Executors), you should never try deleting the master service Depot resources without first removing all branch deployments - i.e. Depot will not allow deleting resources that are still being referenced.

Pointers when looking for resources referenced by deployment logs (executors, datasets, locations, etc...)

Just like with other CloudFormation resources (e.g. lambda, Step Function names), Depot does not allow to have duplicate names for multiple executors/locations/datasets. If you are getting the errors similar to (applies to any resource type such as datasets, locations, executors):

Duplicate name 'XXX' for executors abc123 and def456

It means you are trying to deploy another resource with the same name. The executor/location/dataset ID is coming from the physical ID of the corresponding CloudFormation resource (look in your stack's CloudFormation Resources list).

The duplication can happen because of several reasons:

  • The previous stack has been deleted but dangling resources remain (e.g. due to skipping deletion of Depot resources).
  • The previous stack failed to create on initial deployment and dangling resources remain (known bugs https://onstage.atlassian.net/browse/DPT-2280, https://onstage.atlassian.net/browse/DPT-1970).
  • Depot resources have moved within the stack (e.g. refactoring of service CDK code results in new logical nesting) - this will cause the physical ID to change.
  • Misconfigured deployment (e.g. branch stack deployment does not qualify Depot resource names with branch name for uniqueness). These cases can be identified by looking at CloudFormation events of the stacks - use physical IDs (abc123 and def456) to pinpoint the CloudFormation resources. Cases 1-2 need to be cleaned up manually (contact your Depot platform team). Cases 3-4 can be resolved by the service team, e.g. by renaming the resources (just like you would do for Lambda/Step Function names) or undoing the refactoring.

New environment deployment failures

Often these sorts of failures may happen because of an AWS Account limit that has been hit. VPC and NAT Gateway limits the usual culprits.

The problem with identifying the issue is that these errors don't bubble up to the CDK deployment process. Furthermore, a brand new environment stack that fails will be automatically rolled back and deleted. So, to find the reason, you need to filter CloudFormation's list of stacks by deleted, and find the most recently deleted Depot environment stack (which is likely to be your failed one).

Here is a link to your current account's deleted CloudFormation stacks in the eu-west-1 region.

Look for the environment stack named sdp-{your-env-id}. If you don't know what your new environment ID was, check your Depot Project CDK stack's Resources tab, and find it there with the type Custom::Environment.

Highlight the deleted sdp-{your-env-id} stack, and click Events. Scroll down the list of events until you find the first failure. The reason should usually be clear there as to what went wrong.

Here is an example of a VPC limit being reached when a new environment was created:

VPC Limit hit