Depot Spark Connector
The Depot Spark Connector enables read-only access to Depot-controlled tables and views from Spark, whether the data are in a Snowflake or a Postgres location under Depot.
Operation
- Add a dependency on the Depot Spark Connector to your project
- FIXME: maven coordinates DPT-2940
- Ensure your job definition (Spark Submit) provides the following dependencies in the
--jarsargument:- postgresql.jar (FIXME: version)
- snowflake-jdbc.jar (FIXME: version)
- spark-snowflake_2.12.jar (FIXME: version)
- Ensure your CDK deployment code delivers the following information to EMR job (submit):
- AWS partition, account, region
- Depot environment ID
- Dataset id or alias
- Schema name
- optional: Executor ID to use (on Snowflake) for cost tag allocation
The Depot Spark Connector reacts to the depot format, which takes a url and depot.schema mandatory options:
val spark = SparkSession.builder
.appName("Depot from Spark")
.getOrCreate()
val items = spark.read
.format("depot")
.option("url", "depot:aws:<accountId>:eu-west-1:<depotEnvironmentId>:<datasetIdOrAlias>")
.option("depot.schema", "DogSource")
.load()
.filter(col("testId").startsWith("AUR2SFK-"))
items.show()
The items DataFrame is now a Spark DataFrame that can be used in Spark SQL or further processing. The Depot Spark Connector
resolves whether the dataset is in Snowflake or Postgres and uses the appropriate JDBC driver to connect to the data.
The column names are converted back from the database convention into the naming convention used in the original Depot schema.
Expansion of ref types and of transients is not supported. You should join the data in Spark to reconstruct your data graphs as appropriate.
Postgres notes
On Postgres, the Depot Spark Connector uses the JDBC driver to connect to the database. The Connector interrogates the Depot database to retrieve the RDS instance, and uses it to locate the proper endpoint.
If the endpoint (which is normally within Depot's VPC) is not reachable, the connector will also provision a SSM tunnel via the Aurora Shared Bastion in order to reach the database.
It is possible to select a different read replica (from Aurora Postgres) by setting the replica=some-tag=some-value parameter
to the url:
depot:aws:123456789012:region:env:dataset?replica=some-tag=some-value
In which case, the Connector will locate the Aurora replica that has that tag and use that instead of the master instance within the cluster. This may be useful to separate the traffic caused by the Spark job away from the main database, if some latency is acceptable.
The SSM tunneling is considered a convenience but is likely not fast enough for large-scale data processing. It is highly recommended to set up proper connectivity between the VPC where the Spark job runs and the VPC where Depot runs.
Snowflake notes
FIXME: executorId
Notes
- Writing from Spark is not supported. To return data back into depot, write into Parquet or JSONL files (if Parquet are not available in your target location) and use a standard transaction to merge the results back in.
We are considering better options in the future, but we prioritise data integrity.