> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/build/stage-2-productize/connect-to-engine/spark.md).

# Spark

Apache Spark is a unified analytics engine for large-scale data processing and distributed compute. Vulcan integrates with Spark to manage your data transformations using catalogs like Iceberg, Hive, and Delta Lake.

> **VDE is not supported on Spark.** Setting `vde: true` in `config.yaml` is rejected when the gateway type is `spark`. Spark gateways must run in simple mode (`vde: false`, which is the default).

***

### Engine adapter type

For Spark, you configure a gateway with the following adapter type:

```yaml
type: spark
```

Supports Vulcan local and built-in scheduling.

### Before you start

Make sure you have:

* A running Spark cluster (standalone, YARN, or Kubernetes)
* Spark 3.x or higher (3.4 or above recommended for catalog support)
* Network connectivity to the Spark master node
* Read and write access to the configured storage (S3, HDFS, or compatible)

***

### Required permissions

Vulcan requires the following Spark permissions:

| Permission                                                   | Required for                 |
| ------------------------------------------------------------ | ---------------------------- |
| Access to create and manage tables in the configured catalog | Creating model output tables |
| Read and write access to the configured storage              | Writing physical table files |
| Permission to submit Spark applications                      | Running model queries        |

***

### Required connection options

Use these fields when setting up a Spark gateway:

| Option | Description                        | Required |
| ------ | ---------------------------------- | :------: |
| `type` | Engine type name. Must be `spark`. |    Yes   |

### Optional connection options

| Option       | Description                                                                                        |
| ------------ | -------------------------------------------------------------------------------------------------- |
| `config_dir` | Value to set for `SPARK_CONFIG_DIR`.                                                               |
| `catalog`    | The catalog to use when issuing commands. If not set, defaults to `spark_catalog` for Spark < 3.4. |
| `config`     | Key/value pairs for Spark configuration (e.g. S3 credentials, executor settings).                  |

***

### Authentication

Spark authentication is configured through the `config` parameter or the `SPARK_CONFIG_DIR` environment variable. The method depends on the underlying catalog:

| Method                     | How to configure                                                                         |
| -------------------------- | ---------------------------------------------------------------------------------------- |
| S3 credentials             | Pass `spark.hadoop.fs.s3a.access.key` and `spark.hadoop.fs.s3a.secret.key` via `config`. |
| HDFS Kerberos              | Configure Kerberos settings in `SPARK_CONFIG_DIR`.                                       |
| Iceberg/Delta catalog auth | Depends on the catalog provider and storage backend.                                     |

Always use environment variables for sensitive values:

```yaml
config:
  spark.hadoop.fs.s3a.secret.key: "{{ env_var('S3_SECRET_KEY') }}"
```

***

### Example configuration

Add a Spark gateway to your Vulcan project configuration.

```yaml
gateways:
  default:
    connection:
      type: spark
      catalog: iceberg_catalog
      config:
        spark.hadoop.fs.s3a.endpoint: s3.amazonaws.com
        spark.hadoop.fs.s3a.access.key: "{{ env_var('S3_ACCESS_KEY') }}"
        spark.hadoop.fs.s3a.secret.key: "{{ env_var('S3_SECRET_KEY') }}"

defaultGateway: default

modelDefaults:
  dialect: spark
```

> Set `catalog` to the name of your Iceberg, Hive, or Delta catalog. Vulcan uses this as the default catalog for all model names that do not include an explicit catalog prefix.

> All sensitive values such as S3 keys and database passwords must be passed through environment variables, never written directly in `config.yaml`.

***

> Spark cannot be used for the `stateConnection`. Use a transactional database such as PostgreSQL for state storage.

### Materialization behavior

Spark uses the following materialization strategies depending on the model kind:

| Model kind                  | Strategy                                                                              |
| --------------------------- | ------------------------------------------------------------------------------------- |
| `INCREMENTAL_BY_TIME_RANGE` | INSERT OVERWRITE by time column partition                                             |
| `INCREMENTAL_BY_UNIQUE_KEY` | Not supported. Use `INCREMENTAL_BY_TIME_RANGE` or `INCREMENTAL_BY_PARTITION` instead. |
| `INCREMENTAL_BY_PARTITION`  | INSERT OVERWRITE by partitioning key                                                  |
| `FULL`                      | INSERT OVERWRITE                                                                      |

***

### Next steps

After configuring Spark, continue with:

```
Connect to Engine -> Define models -> Validate and test locally
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/build/stage-2-productize/connect-to-engine/spark.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
