Skip to main content

Backup and Restore

Overview and configuration

Depot supports data backup via features of the primary storage, allowing cost-effective point-in-time backup and restore. Backups can be written across different AWS accounts (to recover from human error or malicious behaviour) and / or regions (to protect against regional AWS failures). This guide will show you how to enable backups and how to run the restore process.

DynamoDB

Supporting backups

To use the backup feature, your DynamoDB location must enable point-in-time recovery as follows:

const ddbLocation = Location.DynamoDB(stack, "MyDDBLocation", {
name: "my-ddb-location",
environment: myDepotEnv,
pointInTimeRecovery: true
});

You also need to add an EMR Serverless Depot Executor to your environment CDK stack.

new Executor.EmrServerless(this, "EmrServerlessExecutor", {
environment: myDepotEnv,
name: "emr-serverless-executor"
});

Aurora

Backups are enabled by default for 7 days only. Backups are taken during a pre-defined backup window, once daily. There is currently no further granularity than this.

Be aware of

warning

Backups are not currently supported for environments that use Snowflake as a primary storage location.

warning

Currently, to perform x-account or x-region backups, AWS Backups must be used to manually configure a schedule targeting the sdp-<env id>-Object, sdp-<env id>-History and sdp-<env id>-Index tables. Scheduling these tasks will be exposed through CDK in a later release, but for manual configuration see: https://eu-west-1.console.aws.amazon.com/backup/home

Create Backup Bucket

You will need an S3 bucket to backup to. The same bucket can be used for multiple backup locations by indicating a backup path prefix. The main requirements are:

  • Versioning must be enabled
  • If the backup bucket is in a different AWS account to the Depot environment, you must provide cross-account access using a bucket policy - for example:
{
"Version": "2012-10-17",
"Id": "Policy1629207315171",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{DEPOT-ACCOUNT-ID}:root"
},
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetBucketVersioning"
],
"Resource": "arn:aws:s3:::{BACK-UP-BUCKET}"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{DEPOT-ACCOUNT-ID}:root"
},
"Action": [
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts",
"s3:DeleteObject",
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::{BACK-UP-BUCKET}/*"
}
]
}
tip

It is recommended for a production environment this bucket is created with S3 object lock enabled to prevent the backup data from being accidentally deleted

Data role access

The environment's Data IAM Role will need to be provided with write access to the S3 bucket that you plan on backing up to. Use the allowedBackupLocations property on the EmrServerless Executor to have Depot automatically add this permission to the data role.

new Executor.EmrServerless(this, "EmrServerlessExecutor", {
environment: myDepotEnv,
name: "emr-serverless-executor",
allowedBackupLocations: ["s3://example-bucket/prefix"]
});

Alteratively, you can add the access yourself by manually adding a new IAM Policy to the IAM role named sdp-{environmentIdPrefix}{environmentId}-data-role:

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example1",
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::example-bucket/*"
},
{
"Sid": "Example2",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::example-bucket"
}
]
}

Just keep in mind that if you try to delete the environment stack at any point in the future, the deletion could fail because of this manually added IAM Policy. To avoid this you will need to remove the policy first. This happens because the IAM Policy is effectively 'unmanaged' from CloudFormation's viewpoint.

Backup

The backup process creates a point-in-time snapshot of the primary storage to the backup location.

Simplified Backup through the CLI

An environment can be backed up using the Depot CLI. If the environment has a single object dataset, simply run the command

depot backup-dataset --datasetId= --backupLocation= --pointInTime=

Where…

  • datasetId is the id of the dataset to back up
  • backupLocation is the S3 output location of the back-up.
  • pointInTime is the date and time that the data should be backup up at (omit to use most recent date)

For example:

depot backup-dataset --datasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix --pointInTime=2021-01-01T00:01:00Z

The backup process is a step function that performs the following actions:

  • Pause writes to the primary (authoritative) storage location (waiting for current writes to complete)
  • Backup the dataset(s) to the backupLocation
  • Resume writes

You can track progress of your backup job using the following command:

depot backup-status

Backup multiple datasets

To backup multiple datasets, use the following CLI command:

depot backup-dataset --datasets=backup-datasets.json

where the provided JSON mapping file (e.g. backup-datasets.json) contains an Array of dataset restore objects: datasetId, backupLocation, pointInTime. The definition of these fields correspond to single-dataset CLI invocation described earlier. For example:

[
{
"datasetId": "eb5b4f04d32a",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo-noDataPrefix",
"pointInTime": "2021-08-30T14:14:17.530Z"
},
{
"datasetId": "282fc9d58b16",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo",
"pointInTime": "2021-08-30T14:14:17.530Z"
}
]

Restore

The restore process takes a previously backed up storage location in S3 and copies the data to any/all of:

  • a new store in the same datasets of the same environment
  • different datasets in the same environment
  • different datasets in a new environment

Simplified Restore through the CLI

An environment can be restored using the Depot CLI. If the environment has a single object dataset, simply run the command

depot restore-dataset --datasetId= --sourceDatasetId= --newId= --backupLocation=

Where…

  • datasetId is the id of the dataset where data will be written to
  • sourceDatasetId is the id of the dataset in the backup location (which may be different if restoring to a different env or a different dataset in the same env), can be omitted if the same as datasetId
  • newId is the id to assign to the new store (this is arbitrary but must be unique). Omit to use auto-generated store id based on current time, e.g. 20210901T120000Z.
  • backupLocation is the S3 location of the back-up including the relevant data path. This should be the path to previously backed up data.

For example:

depot restore-dataset --datasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix

The restore process is a step function that performs the following actions:

  • Pause writes to the primary (authoritative) storage location (waiting for current writes to complete)
  • Create a new, empty data store using the current primary storage location
  • Populate the new store with data from the back-up location
  • Set the new store as the active store for the primary storage location and resume writes

You can track progress of your restore job using the following command:

depot restore-status

Restore to different dataset id

When restoring to a different environment than the source of the back-up, the environment and dataset ids will differ from the original ones. To account for this, include the old dataset id as sourceDatasetId

depot restore-dataset --datasetId=eb5b4f04d32a --sourceDatasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix

The backup location provided should always be the 'root' of the backup, if in doubt this is the same location as passed to the backup process.

Restore multiple datasets

If an environment contains multiple datasets, all of them should be restored together or those left out will not be available in the new Store. To restore multiple datasets, use the following CLI command:

depot restore-dataset --datasets=restore-datasets.json

where the provided JSON mapping file (e.g. restore-datasets.json) contains an Array of dataset restore objects: datasetId, backupLocation, sourceDatasetId. The definition of these fields correspond to single-dataset CLI invocation described earlier. For example:

[
{
"datasetId": "eb5b4f04d32a",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo-noDataPrefix"
},
{
"datasetId": "2c2696d1c8d8",
"sourceDatasetId": "282fc9d58b16",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo"
}
]

Step-by-Step Restore Process

info

We recommend using the CLI commands shown in this section, but YAML representations of API commands are also provided.

This section provides a more detailed understanding of the restore process, as well as instructions on how to run each step manually.

Create a new store for the environment

The restore process starts by creating a new (empty) store, with the new user-supplied storeId. This is possible with the CLI command:

depot create-store --newId=abc123
id: abc123
schema: Store

The new store id becomes part of the path in S3 storage. For example, if multiple restores have taken place with store ids restore01 and restore02, the S3 location would look like:

  • s3://{bucket-name}/gold/{env-id}/{dataset-id}/... (having an initial store id for consistency is planned in a future)
  • s3://{bucket-name}/gold/{env-id}/restore01/{dataset-id}/...
  • s3://{bucket-name}/gold/{env-id}/restore02/{dataset-id}/...
note

The structure with additional store ids will be replicated by the back-up functionality as well. When restoring from a back-up of an environment that had previously been restored (and therefore has multiple stores), the back-up path should include the appropriate store id - e.g. s3://{bucket-name}/gold/{env-id}/restore02/.

Pause data updates to the dataset

A dataset can be in one of the following states: running, pausing or paused.

To prevent updates to a dataset while it’s being restored, the process first pauses updates the dataset. This can be achieved using the following CLI command:

patch-dataset --id=abc123 --data='{"status": "PAUSING"}'
id: abc123
schema: Dataset
status: PAUSING

Data can still be read from the dataset while the dataset is paused.

The dataset will enter a pausing state until all in-progress writes have completed (after sufficient time has passed since last write), at which point the status will change to paused. Use the list-dataset command to check progress.

Behaviour while pausing/paused

Rest and GraphQL API

Any clients attempting to update a data set through the Rest or GraphQL APIs will receive a 503 Service Unavailable error response.

Data Transactions

All data transactions will run to completion to ensure data is consistent, but no new transactions are be processed. New transactions will remain queued and will be processed once the environment has been restarted.

Load data into the new store from the backup location

It is no longer recommended to directly call the process that restores data from a backupLocation, since the interface is not stable, this section is left in since it contains useful information about the process, and at some point a description of the internal step may be re-instated here.

Switch to the new datastore

The environment can now be switched to use the newly migrated store using the CLI or PatchEnvironment API method:

depot patch-environment --data='{"store": {"id": "store02"}}'
id: env123
schema: Environment
store:
id: store02

All requests to the environment will now use the new store.

note

Remember to restore all object datasets before switching the store, because all of them will use the new store. If a dataset does not have data in the new store, it will appear empty.

Resume data updates to the dataset

Re-enable updates to a dataset after switching the store by running the following CLI command:

patch-dataset --id=abc123 --data='{"status": "RUNNING"}'
id: abc123
schema: Dataset
status: RUNNING

You can now make read and write request to the new store through the API.

At this point, data is being written to the new store's partition in all storage locations, and the restore is complete.

Disaster recovery

Whilst the old mechanism of constantly streaming data to S3 is in theory better in a true disaster recovery situation since any point in time can be recovered without creating a backup, it is inferior to the PITR mechanism provided by DynamoDB natively. In particular, it is very difficult to be sure that all data has been reliably replicated to S3, since a bug could be introduced in our handling lambda that renders the backup data invalid.

The proposed structure for a disaster-resilient deployment is as follows:

The delivery of this configuration as an option will be in a later release.