Skip to main content

path-formats

Path Formats

When working with file input from S3 in Depot, four properties influence how Depot locates the files to deliver into the target tables: locationUri, pathFormat, inputSuffix and inputPattern.

locationUri

The locationUri property is a string that specifies the S3 location where Depot should look for files. It is matched with StageSchema entries deployed in the environment to determine how, and with which privileges, to load the files.

When more than one StageSchema is available to access the same S3 location, the entry with the longest length of matching URI is used.

note

Should we not transition to using a source of type StageSchema and begin to sunset the locationUri property?

pathFormat

The pathFormat property is an enum specifies the format of the file paths inside the S3 location. It enables Depot to decode the last components of the file path into a dataset and target schema, enabling the delivery of multiple separate target entities from the same action within a transaction.

pathFormatDescription
DATASET_SCHEMADepot looks for files in the ${locationUri}/${datasetId}/${schemaQualifiedName}/ directory
SCHEMADepot looks for files in the ${locationUri}/${schemaQualifiedName}/ directory
DIRECTThe ${locationUri} is interpreted to be a prefix where to look for multiple files.
EXPLICITThe ${locationUri} is interpreted to be the uri of a single file. No file search is performed.
tip

The DATASET_SCHEMA and SCHEMA path formats allow you to omit the target argument for certain operations (INSERT, UPSERT and DELETE on Objects). We discourage continued reliance on this ability as it requires that Depot inspects the source buckets for potential schemas, and you lack control on the insertion order between schemas during the transaction. This can be costly in a bucket with many objects.

It is far better to explicitly specify each target as an individual action within a transaction.

For certain operations (MERGE) and schema types (records), the target argument is mandatory.

inputSuffix

When the inputSuffix property is set, Depot will append the suffix to the pathFormat-determined path and look for files in the subdirectory with that name. For instance:


depot.runTransaction({
locationUri: 's3://my-bucket/my-folder',
pathFormat: 'DATASET_SCHEMA',
inputSuffix: 'my-suffix',
datasetId: 'my-dataset',
target: 'my-schema.MyTable',
inputPattern: 'LOAD.*\.parquet',
fileFormat: 'PARQUET'
})

This will look for files named LOAD*.parquet in the s3://my-bucket/my-folder/my-dataset/my-schema.MyTable/my-suffix directory, and reject any other parquet files found in that directory.

The use of inputSuffix only valid for DATASET_SCHEMA and SCHEMA path formats. It is rejected in EXPLICIT and DIRECT formats.

inputPattern

In all formats but the EXPLICIT, it is possible to manually override which files are loaded by specifying a inputPattern property. This inputPattern is a POSIX regular expression that is matched against the file names in the directory.

It is passed on (adapted as necessary) to the Aurora S3 file loader on Postgres, or to Snowflake's PATTERN clause in the COPY INTO statement.

To match for files starting with LOAD and ending with .parquet one would use the following pattern: LOAD.*\.parquet

tip

On Snowflake's COPY INTO statement, the PATTERN expression should match relative to the uri pointed at by the deployed stage. Depot will automatically prepend the path prefix to the rendered PATTERN expression, taking into account the pathFormat (DATASET_SCHEMA or SCHEMA) and inputSuffix properties: it is not necessary to include the path prefix in the inputPattern.

Because of Depot's automatic path prefixing, the inputPattern should not start with a / or a ^ character.