Backup and Restore
Overview and configuration
Depot supports data backup via features of the primary storage, allowing cost-effective point-in-time backup and restore. Backups can be written across different AWS accounts (to recover from human error or malicious behaviour) and / or regions (to protect against regional AWS failures). This guide will show you how to enable backups and how to run the restore process.
DynamoDB
To use the backup feature, your DynamoDB location must enable point-in-time recovery as follows:
const ddbLocation = Location.DynamoDB(stack, "MyDDBLocation", {
name: "my-ddb-location",
environment: myDepotEnv,
pointInTimeRecovery: true
});
You also need to add an EMR Serverless Depot Executor to your environment CDK stack.
new Executor.EmrServerless(this, "EmrServerlessExecutor", {
environment: myDepotEnv,
name: "emr-serverless-executor"
});
Aurora
Backups are enabled by default for 7 days only. Backups are taken during a pre-defined backup window, once daily. There is currently no further granularity than this.
Be aware of
Backups are not currently supported for environments that use Snowflake as a primary storage location.
Currently, to perform x-account or x-region backups, AWS Backups must be used to manually configure a schedule targeting the
sdp-<env id>-Object, sdp-<env id>-History and sdp-<env id>-Index tables. Scheduling these tasks will be exposed through
CDK in a later release, but for manual configuration see: https://eu-west-1.console.aws.amazon.com/backup/home
Create Backup Bucket
You will need an S3 bucket to backup to. The same bucket can be used for multiple backup locations by indicating a backup path prefix. The main requirements are:
- Versioning must be enabled
- If the backup bucket is in a different AWS account to the Depot environment, you must provide cross-account access using a bucket policy - for example:
{
"Version": "2012-10-17",
"Id": "Policy1629207315171",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{DEPOT-ACCOUNT-ID}:root"
},
"Action": [
"s3:ListBucket",
"s3:ListBucketMultipartUploads",
"s3:GetBucketVersioning"
],
"Resource": "arn:aws:s3:::{BACK-UP-BUCKET}"
},
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::{DEPOT-ACCOUNT-ID}:root"
},
"Action": [
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts",
"s3:DeleteObject",
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::{BACK-UP-BUCKET}/*"
}
]
}
It is recommended for a production environment this bucket is created with S3 object lock enabled to prevent the backup data from being accidentally deleted
Data role access
The environment's Data IAM Role will need to be provided with write access to the S3 bucket that you plan on backing up to. Use the allowedBackupLocations
property on the EmrServerless Executor to have Depot automatically add this permission to the data role.
new Executor.EmrServerless(this, "EmrServerlessExecutor", {
environment: myDepotEnv,
name: "emr-serverless-executor",
allowedBackupLocations: ["s3://example-bucket/prefix"]
});
Alteratively, you can add the access yourself by manually adding a new IAM Policy to the IAM role named sdp-{environmentIdPrefix}{environmentId}-data-role:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Example1",
"Effect": "Allow",
"Action": "s3:*",
"Resource": "arn:aws:s3:::example-bucket/*"
},
{
"Sid": "Example2",
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::example-bucket"
}
]
}
Just keep in mind that if you try to delete the environment stack at any point in the future, the deletion could fail because of this manually added IAM Policy. To avoid this you will need to remove the policy first. This happens because the IAM Policy is effectively 'unmanaged' from CloudFormation's viewpoint.
Backup
The backup process creates a point-in-time snapshot of the primary storage to the backup location.
Simplified Backup through the CLI
An environment can be backed up using the Depot CLI. If the environment has a single object dataset, simply run the command
depot backup-dataset --datasetId= --backupLocation= --pointInTime=
Where…
datasetIdis the id of the dataset to back upbackupLocationis the S3 output location of the back-up.pointInTimeis the date and time that the data should be backup up at (omit to use most recent date)
For example:
depot backup-dataset --datasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix --pointInTime=2021-01-01T00:01:00Z
The backup process is a step function that performs the following actions:
- Pause writes to the primary (authoritative) storage location (waiting for current writes to complete)
- Backup the dataset(s) to the backupLocation
- Resume writes
You can track progress of your backup job using the following command:
depot backup-status
Backup multiple datasets
To backup multiple datasets, use the following CLI command:
depot backup-dataset --datasets=backup-datasets.json
where the provided JSON mapping file (e.g. backup-datasets.json) contains an Array of dataset restore objects: datasetId, backupLocation, pointInTime.
The definition of these fields correspond to single-dataset CLI invocation described earlier. For example:
[
{
"datasetId": "eb5b4f04d32a",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo-noDataPrefix",
"pointInTime": "2021-08-30T14:14:17.530Z"
},
{
"datasetId": "282fc9d58b16",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo",
"pointInTime": "2021-08-30T14:14:17.530Z"
}
]
Restore
The restore process takes a previously backed up storage location in S3 and copies the data to any/all of:
- a new store in the same datasets of the same environment
- different datasets in the same environment
- different datasets in a new environment
Simplified Restore through the CLI
An environment can be restored using the Depot CLI. If the environment has a single object dataset, simply run the command
depot restore-dataset --datasetId= --sourceDatasetId= --newId= --backupLocation=
Where…
datasetIdis the id of the dataset where data will be written tosourceDatasetIdis the id of the dataset in the backup location (which may be different if restoring to a different env or a different dataset in the same env), can be omitted if the same asdatasetIdnewIdis the id to assign to the new store (this is arbitrary but must be unique). Omit to use auto-generated store id based on current time, e.g. 20210901T120000Z.backupLocationis the S3 location of the back-up including the relevant data path. This should be the path to previously backed up data.
For example:
depot restore-dataset --datasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix
The restore process is a step function that performs the following actions:
- Pause writes to the primary (authoritative) storage location (waiting for current writes to complete)
- Create a new, empty data store using the current primary storage location
- Populate the new store with data from the back-up location
- Set the new store as the active store for the primary storage location and resume writes
You can track progress of your restore job using the following command:
depot restore-status
Restore to different dataset id
When restoring to a different environment than the source of the back-up, the environment and dataset ids will differ from the original ones.
To account for this, include the old dataset id as sourceDatasetId
depot restore-dataset --datasetId=eb5b4f04d32a --sourceDatasetId=7c80cad5c403 --backupLocation=s3://my-backup-bucket/backupPrefix
The backup location provided should always be the 'root' of the backup, if in doubt this is the same location as passed to the backup process.
Restore multiple datasets
If an environment contains multiple datasets, all of them should be restored together or those left out will not be available in the new Store. To restore multiple datasets, use the following CLI command:
depot restore-dataset --datasets=restore-datasets.json
where the provided JSON mapping file (e.g. restore-datasets.json) contains an Array of dataset restore objects: datasetId, backupLocation, sourceDatasetId. The definition of these fields correspond to single-dataset CLI invocation described earlier. For example:
[
{
"datasetId": "eb5b4f04d32a",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo-noDataPrefix"
},
{
"datasetId": "2c2696d1c8d8",
"sourceDatasetId": "282fc9d58b16",
"backupLocation": "s3://example-sdp-s3-backup-test/ws-crd-demo"
}
]
Step-by-Step Restore Process
We recommend using the CLI commands shown in this section, but YAML representations of API commands are also provided.
This section provides a more detailed understanding of the restore process, as well as instructions on how to run each step manually.
Create a new store for the environment
The restore process starts by creating a new (empty) store, with the new user-supplied storeId. This is possible with the CLI command:
depot create-store --newId=abc123
id: abc123
schema: Store
The new store id becomes part of the path in S3 storage. For example, if multiple restores have taken place with store ids restore01 and restore02, the S3 location would look like:
s3://{bucket-name}/gold/{env-id}/{dataset-id}/...(having an initial store id for consistency is planned in a future)s3://{bucket-name}/gold/{env-id}/restore01/{dataset-id}/...s3://{bucket-name}/gold/{env-id}/restore02/{dataset-id}/...
The structure with additional store ids will be replicated by the back-up functionality as well.
When restoring from a back-up of an environment that had previously been restored (and therefore has multiple stores),
the back-up path should include the appropriate store id - e.g. s3://{bucket-name}/gold/{env-id}/restore02/.
Pause data updates to the dataset
A dataset can be in one of the following states: running, pausing or paused.
To prevent updates to a dataset while it’s being restored, the process first pauses updates the dataset. This can be achieved using the following CLI command:
patch-dataset --id=abc123 --data='{"status": "PAUSING"}'
id: abc123
schema: Dataset
status: PAUSING
Data can still be read from the dataset while the dataset is paused.
The dataset will enter a pausing state until all in-progress writes have completed (after sufficient time has passed since last write), at which point the status will change to paused. Use the list-dataset command to check progress.
Behaviour while pausing/paused
Rest and GraphQL API
Any clients attempting to update a data set through the Rest or GraphQL APIs will receive a 503 Service Unavailable error response.
Data Transactions
All data transactions will run to completion to ensure data is consistent, but no new transactions are be processed. New transactions will remain queued and will be processed once the environment has been restarted.
Load data into the new store from the backup location
It is no longer recommended to directly call the process that restores data from a backupLocation, since the interface is not stable, this section is left in since it contains useful information about the process, and at some point a description of the internal step may be re-instated here.
Switch to the new datastore
The environment can now be switched to use the newly migrated store using the CLI or PatchEnvironment API method:
depot patch-environment --data='{"store": {"id": "store02"}}'
id: env123
schema: Environment
store:
id: store02
All requests to the environment will now use the new store.
Remember to restore all object datasets before switching the store, because all of them will use the new store. If a dataset does not have data in the new store, it will appear empty.
Resume data updates to the dataset
Re-enable updates to a dataset after switching the store by running the following CLI command:
patch-dataset --id=abc123 --data='{"status": "RUNNING"}'
id: abc123
schema: Dataset
status: RUNNING
You can now make read and write request to the new store through the API.
At this point, data is being written to the new store's partition in all storage locations, and the restore is complete.
Disaster recovery
Whilst the old mechanism of constantly streaming data to S3 is in theory better in a true disaster recovery situation since any point in time can be recovered without creating a backup, it is inferior to the PITR mechanism provided by DynamoDB natively. In particular, it is very difficult to be sure that all data has been reliably replicated to S3, since a bug could be introduced in our handling lambda that renders the backup data invalid.
The proposed structure for a disaster-resilient deployment is as follows:
- DynamoDB global tables are used to place replicas of the important tables into multiple regions (selected by appropriateness as failover sites)
- PITR is enabled on all instances of the global table (able to create a point-in-time backup if the table exists in any region)
- Daily or higher frequency whole-table system backups can be scheduled using https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/BackupRestore.html
- These daily backups could be exported to S3 if desired, although the frequency of these backups could be lower since they are more expensive than system backups
- DynamoDB creates a system backup of any deleted table with PITR enabled (as per https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/PointInTimeRecovery_Howitworks.html)
- Existing stream to S3 data may be retained regardless, it is just no longer considered authoritative for backup/restore purposes.
The delivery of this configuration as an option will be in a later release.