Depot Spark Connector

warning

This is a work in progress. The Depot Spark Connector is currently being incubated in one project directory, and will be moved to a more appropriate location in the future DPT-2939 DPT-2940

The Depot Spark Connector enables read-only access to Depot-controlled tables and views from Spark, whether the data are in a Snowflake or a Postgres location under Depot.

Operation

Add a dependency on the Depot Spark Connector to your project
- FIXME: maven coordinates DPT-2940
Ensure your job definition (Spark Submit) provides the following dependencies in the --jars argument:
- postgresql.jar (FIXME: version)
- snowflake-jdbc.jar (FIXME: version)
- spark-snowflake_2.12.jar (FIXME: version)
Ensure your CDK deployment code delivers the following information to EMR job (submit):
- AWS partition, account, region
- Depot environment ID
- Dataset id or alias
- Schema name
- optional: Executor ID to use (on Snowflake) for cost tag allocation

The Depot Spark Connector reacts to the depot format, which takes a url and depot.schema mandatory options:

    val spark = SparkSession.builder
      .appName("Depot from Spark")
      .getOrCreate()

    val items = spark.read
      .format("depot")
      .option("url", "depot:aws:<accountId>:eu-west-1:<depotEnvironmentId>:<datasetIdOrAlias>")
      .option("depot.schema", "DogSource")
      .load()
      .filter(col("testId").startsWith("AUR2SFK-"))

   items.show()

The items DataFrame is now a Spark DataFrame that can be used in Spark SQL or further processing. The Depot Spark Connector resolves whether the dataset is in Snowflake or Postgres and uses the appropriate JDBC driver to connect to the data.

The column names are converted back from the database convention into the naming convention used in the original Depot schema.

tip

Expansion of ref types and of transients is not supported. You should join the data in Spark to reconstruct your data graphs as appropriate.

Postgres notes

On Postgres, the Depot Spark Connector uses the JDBC driver to connect to the database. The Connector interrogates the Depot database to retrieve the RDS instance, and uses it to locate the proper endpoint.

If the endpoint (which is normally within Depot's VPC) is not reachable, the connector will also provision a SSM tunnel via the Aurora Shared Bastion in order to reach the database.

It is possible to select a different read replica (from Aurora Postgres) by setting the replica=some-tag=some-value parameter to the url:

depot:aws:123456789012:region:env:dataset?replica=some-tag=some-value

In which case, the Connector will locate the Aurora replica that has that tag and use that instead of the master instance within the cluster. This may be useful to separate the traffic caused by the Spark job away from the main database, if some latency is acceptable.

warning

The SSM tunneling is considered a convenience but is likely not fast enough for large-scale data processing. It is highly recommended to set up proper connectivity between the VPC where the Spark job runs and the VPC where Depot runs.

Snowflake notes

FIXME: executorId

Notes

Writing from Spark is not supported. To return data back into depot, write into Parquet or JSONL files (if Parquet are not available in your target location) and use a standard transaction to merge the results back in.

We are considering better options in the future, but we prioritise data integrity.

Operation​

Postgres notes​

Snowflake notes​

Notes​

Operation

Postgres notes

Snowflake notes

Notes