path-formats
Path Formats
When working with file input from S3 in Depot, four properties influence how Depot locates the files to deliver
into the target tables: locationUri, pathFormat, inputSuffix and inputPattern.
locationUri
The locationUri property is a string that specifies the S3 location where Depot should look for files. It is matched
with StageSchema entries deployed in the environment to determine how, and with which privileges, to load the files.
When more than one StageSchema is available to access the same S3 location, the entry with the longest length
of matching URI is used.
Should we not transition to using a source of type StageSchema and begin to sunset the locationUri property?
pathFormat
The pathFormat property is an enum specifies the format of the file paths inside the S3 location. It enables
Depot to decode the last components of the file path into a dataset and target schema, enabling the delivery of multiple
separate target entities from the same action within a transaction.
pathFormat | Description |
|---|---|
DATASET_SCHEMA | Depot looks for files in the ${locationUri}/${datasetId}/${schemaQualifiedName}/ directory |
SCHEMA | Depot looks for files in the ${locationUri}/${schemaQualifiedName}/ directory |
DIRECT | The ${locationUri} is interpreted to be a prefix where to look for multiple files. |
EXPLICIT | The ${locationUri} is interpreted to be the uri of a single file. No file search is performed. |
The DATASET_SCHEMA and SCHEMA path formats allow you to omit the target argument for certain operations
(INSERT, UPSERT and DELETE on Objects). We discourage continued reliance on this ability as it requires that Depot
inspects the source buckets for potential schemas, and you lack control on the insertion order between schemas
during the transaction. This can be costly in a bucket with many objects.
It is far better to explicitly specify each target as an individual action within a transaction.
For certain operations (MERGE) and schema types (records), the target argument is mandatory.
inputSuffix
When the inputSuffix property is set, Depot will append the suffix to the pathFormat-determined path and look for files
in the subdirectory with that name. For instance:
depot.runTransaction({
locationUri: 's3://my-bucket/my-folder',
pathFormat: 'DATASET_SCHEMA',
inputSuffix: 'my-suffix',
datasetId: 'my-dataset',
target: 'my-schema.MyTable',
inputPattern: 'LOAD.*\.parquet',
fileFormat: 'PARQUET'
})
This will look for files named LOAD*.parquet in the s3://my-bucket/my-folder/my-dataset/my-schema.MyTable/my-suffix directory,
and reject any other parquet files found in that directory.
The use of inputSuffix only valid for DATASET_SCHEMA and SCHEMA path formats. It is rejected in EXPLICIT and DIRECT formats.
inputPattern
In all formats but the EXPLICIT, it is possible to manually override which files are loaded by specifying a inputPattern property.
This inputPattern is a POSIX regular expression that is matched against the file names in the directory.
It is passed on (adapted as necessary) to the Aurora S3 file loader on Postgres, or to Snowflake's PATTERN clause in the COPY INTO statement.
To match for files starting with LOAD and ending with .parquet one would use the following pattern: LOAD.*\.parquet
On Snowflake's COPY INTO statement, the PATTERN expression should match relative to the uri pointed at by the deployed stage.
Depot will automatically prepend the path prefix to the rendered PATTERN expression, taking into account the
pathFormat (DATASET_SCHEMA or SCHEMA) and inputSuffix properties: it is not necessary to include the path prefix in the inputPattern.
Because of Depot's automatic path prefixing, the inputPattern should not start with a / or a ^ character.