> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/lakehouse/configurations.md).

# Configurations

## Structure of Lakehouse manifest

```yaml
name: ${resource-name}
version: v1alpha
type: lakehouse
description: ${description}
tags:
  - ${tag1}
  - ${tag2}
spec:
  compute: ${compute-resource-name}
  runAsUser: ${user-id}
  logLevel: ${log-level}
  iceberg:
    metastore:
      type: iceberg-jdbc-catalog
      replicas: ${number-of-replicas}
      secret: ${tenant}:${metastore-secret-name}
      postgresql:
        host: ${postgres-host}
        port: ${port}
        database: ${database-name}
        schema: ${schema-name}
      hadoopConf:
        ${hadoop-configuration-key}: ${value}
      resources:
        requests:
          cpu: ${cpu-request}
          memory: ${memory-request}
        limits:
          cpu: ${cpu-limit}
          memory: ${memory-limit}
    storage:
      type: ${storage-type}
      # For S3 storage
      s3:
        bucket: ${s3-bucket}
        relativePath: ${relative-path}
        scheme: ${scheme}
        format: ${format}
      # For ABFSS storage
      abfss:
        account: ${abfss-account}
        container: ${container}
        relativePath: ${relative-path}
        endpointSuffix: ${endpoint-suffix}
        format: ${format}
      # For GCS storage
      gcs:
        bucket: ${gcs-bucket}
        relativePath: ${relative-path}
        format: ${format}
      secret: ${tenant}:${storage-secret-name}
    sherpa:
      replicas: ${number-of-replicas}
      resources:
        requests:
          cpu: ${cpu-request}
          memory: ${memory-request}
        limits:
          cpu: ${cpu-limit}
          memory: ${memory-limit}
    sparkCluster:
      server:
        requests:
          cpu: ${cpu-request}
          memory: ${memory-request}
        limits:
          cpu: ${cpu-limit}
          memory: ${memory-limit}
      driver:
        coreLimit: ${core-limit}
        cores: ${number-of-cores}
        coreRequest: ${cpu-request}
        memory: ${memory}
      executor:
        coreLimit: ${core-limit}
        cores: ${number-of-cores}
        coreRequest: ${cpu-request}
        memory: ${memory}
        instances: ${number-of-instances}
```

## Attribute details

### Resource meta section

The resource meta section holds metadata attributes that apply to all Resource types.

#### `name`

**Description:** The name of the Lakehouse Resource. Must be unique within the instance.

| Data Type | Requirement | Default Value | Possible Value                                                                                                                                                         |
| --------- | ----------- | ------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| string    | mandatory   | none          | A valid string that matches the regex pattern `[a-z]([a-z0-9]*)`. Special characters, except for hyphens/dashes, are not allowed. The maximum length is 48 characters. |

**Example Usage:**

```yaml
name: s3pglh
```

#### `version`

**Description:** The manifest version for the Lakehouse Resource.

| Data Type | Requirement | Default Value | Possible Value |
| --------- | ----------- | ------------- | -------------- |
| string    | mandatory   | none          | v1alpha        |

**Example Usage:**

```yaml
version: v1alpha
```

#### `type`

**Description:** The type of DataOS Resource.

| Data Type | Requirement | Default Value | Possible Value |
| --------- | ----------- | ------------- | -------------- |
| string    | mandatory   | none          | lakehouse      |

**Example Usage:**

```yaml
type: lakehouse
```

#### `description`

**Description:** A brief description of the Lakehouse Resource.

| Data Type | Requirement | Default Value | Possible Value |
| --------- | ----------- | ------------- | -------------- |
| string    | optional    | none          | any string     |

**Example Usage:**

```yaml
description: S3-backed Lakehouse for analytics workloads
```

#### `tags`

**Description:** Tags for categorizing and filtering the Lakehouse Resource.

| Data Type       | Requirement | Default Value | Possible Value        |
| --------------- | ----------- | ------------- | --------------------- |
| list of strings | optional    | none          | list of valid strings |

**Example Usage:**

```yaml
tags:
  - lakehouse
  - s3
  - production
```

### Lakehouse-specific section (`spec`)

#### `spec`

**Description**: a YAML mapping that holds all Lakehouse-specific configuration, including type, compute, Iceberg configurations, and log level.

| Data Type | Requirement | Default Value | Possible Value |
| --------- | ----------- | ------------- | -------------- |
| mapping   | mandatory   | none          | none           |

**Example Usage:**

```yaml
spec:
  compute: ironstorm-compute
  runAsUser: iamgroot
  logLevel: INFO
  iceberg:
    # ...
```

#### `spec.compute`

**Description:** Defines the Compute Resource to be used by the Lakehouse for its runtime components.

| Data Type | Requirement | Default Value | Possible Value              |
| --------- | ----------- | ------------- | --------------------------- |
| string    | mandatory   | none          | valid Compute Resource name |

**Example Usage:**

```yaml
spec:
  compute: ironstorm-compute
```

#### `spec.runAsUser`

**Description**: the user ID under which Lakehouse operations run. When set, it grants authority to perform operations on behalf of that user.

| Data Type | Requirement | Default Value | Possible Value       |
| --------- | ----------- | ------------- | -------------------- |
| string    | mandatory   | none          | valid DataOS user ID |

**Example Usage:**

```yaml
spec:
  runAsUser: iamgroot
```

#### `spec.logLevel`

**Description:** Sets the logging verbosity for Lakehouse components.

| Data Type | Requirement | Default Value | Possible Value           |
| --------- | ----------- | ------------- | ------------------------ |
| string    | optional    | INFO          | DEBUG, INFO, WARN, ERROR |

**Example Usage:**

```yaml
spec:
  logLevel: DEBUG
```

### Iceberg section (`spec.iceberg`)

#### `iceberg`

**Description:** Contains configurations for the Lakehouse built on the Iceberg table format, including metastore, storage, and Sherpa settings.

| Data Type | Requirement | Default Value | Possible Value                              |
| --------- | ----------- | ------------- | ------------------------------------------- |
| mapping   | mandatory   | none          | valid Iceberg Lakehouse-specific attributes |

**Example Usage:**

```yaml
spec:
  iceberg:
    metastore:
      # metastore configuration
    storage:
      # storage configuration
    sherpa:
      # sherpa configuration
```

### Metastore section (`spec.iceberg.metastore`)

#### `metastore`

**Description**: configuration for the REST Catalog metastore used by Iceberg tables. The metastore is backed by a PostgreSQL database via JDBC.

| Data Type | Requirement | Default Value | Possible Value                   |
| --------- | ----------- | ------------- | -------------------------------- |
| mapping   | mandatory   | none          | metastore configuration settings |

**Example Usage:**

```yaml
metastore:
  type: iceberg-jdbc-catalog
  replicas: 1
  secret: engineering:lhpgsecret
  postgresql:
    host: myhost.postgres.database.azure.com
    port: "5432"
    database: mydb
    schema: public
```

#### `metastore.type`

**Description:** Specifies the type of metastore catalog. The REST Catalog uses a JDBC-backed catalog with PostgreSQL.

| Data Type | Requirement | Default Value | Possible Value       |
| --------- | ----------- | ------------- | -------------------- |
| string    | mandatory   | none          | iceberg-jdbc-catalog |

**Example Usage:**

```yaml
metastore:
  type: iceberg-jdbc-catalog
```

#### `metastore.replicas`

**Description:** The number of replicas for the metastore service.

| Data Type | Requirement | Default Value | Possible Value             |
| --------- | ----------- | ------------- | -------------------------- |
| integer   | optional    | 1             | any valid positive integer |

**Example Usage:**

```yaml
metastore:
  replicas: 2
```

#### `metastore.secret`

**Description:** Reference to the secret containing PostgreSQL credentials for the metastore. The format is `${tenant}:${secret-name}`.

| Data Type | Requirement | Default Value | Possible Value                                        |
| --------- | ----------- | ------------- | ----------------------------------------------------- |
| string    | mandatory   | none          | valid secret reference in `tenant:secret-name` format |

**Example Usage:**

```yaml
metastore:
  secret: engineering:lhpgsecret
```

#### `metastore.postgresql`

**Description:** Connection details for the PostgreSQL database used as the metastore persistence layer.

| Data Type | Requirement | Default Value | Possible Value                            |
| --------- | ----------- | ------------- | ----------------------------------------- |
| mapping   | mandatory   | none          | valid PostgreSQL connection configuration |

{% hint style="warning" %}
Each Lakehouse must use a **dedicated PostgreSQL database**. Sharing the same database across multiple Lakehouses causes the metastore to intermix metadata file paths from different storage backends and credentials. When a scan runs against one Lakehouse, the REST Catalog returns file paths that belong to other Lakehouses, which the current storage secret cannot access, causing metadata operations to fail. You can host multiple databases on the same PostgreSQL server, but each Lakehouse must reference its own isolated database.
{% endhint %}

**Sub-attributes:**

| Attribute  | Data Type | Requirement | Description                                  |
| ---------- | --------- | ----------- | -------------------------------------------- |
| `host`     | string    | mandatory   | The hostname or URL of the PostgreSQL server |
| `port`     | string    | mandatory   | The port number (typically `"5432"`)         |
| `database` | string    | mandatory   | The database name                            |
| `schema`   | string    | mandatory   | The schema name (typically `public`)         |

**Example Usage:**

```yaml
metastore:
  postgresql:
    host: modern-postgresql-server.postgres.database.azure.com
    port: "5432"
    database: db_englh_s3
    schema: public
```

#### `metastore.hadoopConf`

**Description:** Additional Hadoop configuration properties for the metastore (e.g., S3 connection tuning).

| Data Type | Requirement | Default Value | Possible Value                                     |
| --------- | ----------- | ------------- | -------------------------------------------------- |
| mapping   | optional    | none          | key-value pairs of Hadoop configuration properties |

**Example Usage:**

```yaml
metastore:
  hadoopConf:
    fs.s3a.connection.maximum: 1000
```

#### `metastore.resources`

**Description:** CPU and memory resource allocations for the metastore service.

| Data Type | Requirement | Default Value | Possible Value                         |
| --------- | ----------- | ------------- | -------------------------------------- |
| mapping   | optional    | none          | CPU and memory resource configurations |

**Example Usage:**

```yaml
metastore:
  resources:
    requests:
      cpu: "1"
      memory: 1000Mi
    limits:
      cpu: "1"
      memory: 2000Mi
```

**Sub-attributes:**

| Attribute         | Data Type | Requirement | Description               |
| ----------------- | --------- | ----------- | ------------------------- |
| `requests.cpu`    | string    | optional    | Minimum CPU allocation    |
| `requests.memory` | string    | optional    | Minimum memory allocation |
| `limits.cpu`      | string    | optional    | Maximum CPU allocation    |
| `limits.memory`   | string    | optional    | Maximum memory allocation |

### Storage section (`spec.iceberg.storage`)

#### `storage`

**Description:** Defines the connection to the underlying cloud object storage for the Lakehouse.

| Data Type | Requirement | Default Value | Possible Value              |
| --------- | ----------- | ------------- | --------------------------- |
| mapping   | mandatory   | none          | valid storage configuration |

#### `storage.type`

**Description:** The type of cloud storage backend.

| Data Type | Requirement | Default Value | Possible Value |
| --------- | ----------- | ------------- | -------------- |
| string    | mandatory   | none          | s3, abfss, gcs |

**Example Usage:**

```yaml
storage:
  type: s3
```

#### `storage.secret`

**Description:** Reference to the secret containing cloud storage credentials. The format is `${tenant}:${secret-name}`.

| Data Type | Requirement | Default Value | Possible Value                                        |
| --------- | ----------- | ------------- | ----------------------------------------------------- |
| string    | mandatory   | none          | valid secret reference in `tenant:secret-name` format |

**Example Usage:**

```yaml
storage:
  secret: engineering:s3-secrets
```

#### `storage.s3`

**Description:** Configuration for S3-type storage.

| Data Type | Requirement                 | Default Value | Possible Value         |
| --------- | --------------------------- | ------------- | ---------------------- |
| mapping   | mandatory (when type is s3) | none          | valid S3 configuration |

**Sub-attributes:**

| Attribute      | Data Type | Requirement | Description                                         |
| -------------- | --------- | ----------- | --------------------------------------------------- |
| `bucket`       | string    | mandatory   | The name of the S3 bucket                           |
| `relativePath` | string    | optional    | Folder path within the bucket for data organization |
| `region`       | string    | optional    | The AWS region (e.g., `ap-south-1`)                 |
| `scheme`       | string    | optional    | The access scheme (e.g., `s3a://`)                  |
| `format`       | string    | optional    | Data format, defaults to `ICEBERG`                  |

**Example Usage:**

```yaml
storage:
  type: s3
  s3:
    bucket: lakehouse-production
    relativePath: analytics
    region: ap-south-1
    format: ICEBERG
  secret: engineering:s3-secrets
```

#### `storage.abfss`

**Description:** Configuration for ABFSS (Azure Blob File System Secure) storage.

| Data Type | Requirement                    | Default Value | Possible Value            |
| --------- | ------------------------------ | ------------- | ------------------------- |
| mapping   | mandatory (when type is abfss) | none          | valid ABFSS configuration |

**Sub-attributes:**

| Attribute        | Data Type | Requirement | Description                                        |
| ---------------- | --------- | ----------- | -------------------------------------------------- |
| `account`        | string    | mandatory   | The Azure storage account name                     |
| `container`      | string    | mandatory   | The container name                                 |
| `relativePath`   | string    | optional    | Folder path within the container                   |
| `endpointSuffix` | string    | mandatory   | The endpoint suffix (e.g., `dfs.core.windows.net`) |

**Example Usage:**

```yaml
storage:
  type: abfss
  abfss:
    account: mockdataos
    container: dropzone001
    relativePath: lh01
    endpointSuffix: dfs.core.windows.net
  secret: engineering:abfss-secrets
```

#### `storage.gcs`

**Description:** Configuration for Google Cloud Storage.

| Data Type | Requirement                  | Default Value | Possible Value          |
| --------- | ---------------------------- | ------------- | ----------------------- |
| mapping   | mandatory (when type is gcs) | none          | valid GCS configuration |

**Sub-attributes:**

| Attribute      | Data Type | Requirement | Description                   |
| -------------- | --------- | ----------- | ----------------------------- |
| `bucket`       | string    | mandatory   | The GCS bucket name           |
| `relativePath` | string    | optional    | Folder path within the bucket |

**Example Usage:**

```yaml
storage:
  type: gcs
  gcs:
    bucket: gcs-lakehouse-prod
    relativePath: analytics
  secret: engineering:gcs-secrets
```

### Sherpa section (`spec.iceberg.sherpa`)

#### `sherpa`

**Description:** Configuration for the Sherpa orchestration sidecar that handles operation queuing and task delegation.

| Data Type | Requirement | Default Value | Possible Value                |
| --------- | ----------- | ------------- | ----------------------------- |
| mapping   | optional    | none          | Sherpa configuration settings |

**Example Usage:**

```yaml
sherpa:
  replicas: 1
  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 1000Mi
```

#### `sherpa.replicas`

**Description:** The number of replicas for the Sherpa service.

| Data Type | Requirement | Default Value | Possible Value             |
| --------- | ----------- | ------------- | -------------------------- |
| integer   | optional    | 1             | any valid positive integer |

#### `sherpa.resources`

**Description:** CPU and memory resource allocations for the Sherpa service.

| Data Type | Requirement | Default Value | Possible Value                         |
| --------- | ----------- | ------------- | -------------------------------------- |
| mapping   | optional    | none          | CPU and memory resource configurations |

**Sub-attributes:**

| Attribute         | Data Type | Requirement | Description               |
| ----------------- | --------- | ----------- | ------------------------- |
| `requests.cpu`    | string    | optional    | Minimum CPU allocation    |
| `requests.memory` | string    | optional    | Minimum memory allocation |
| `limits.cpu`      | string    | optional    | Maximum CPU allocation    |
| `limits.memory`   | string    | optional    | Maximum memory allocation |

### Spark Cluster section (`spec.iceberg.sparkCluster`)

#### `sparkCluster`

**Description**: configures the embedded Spark Cluster provisioned by the Lakehouse. The cluster handles query execution and maintenance operations. It consists of a server, a driver, and one or more executors.

| Data Type | Requirement | Default Value | Possible Value                       |
| --------- | ----------- | ------------- | ------------------------------------ |
| mapping   | optional    | none          | Spark Cluster configuration settings |

**Example Usage:**

```yaml
sparkCluster:
  server:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      cpu: 500m
      memory: 1000Mi
  driver:
    coreLimit: "2048m"
    cores: 1
    coreRequest: "500m"
    memory: "2048m"
  executor:
    coreLimit: "2048m"
    cores: 1
    coreRequest: "500m"
    memory: "2048m"
    instances: 1
```

#### `sparkCluster.server`

**Description:** Resource allocation for the Spark server component that manages the cluster lifecycle.

| Data Type | Requirement | Default Value | Possible Value                         |
| --------- | ----------- | ------------- | -------------------------------------- |
| mapping   | optional    | none          | CPU and memory resource configurations |

**Sub-attributes:**

| Attribute         | Data Type | Requirement | Description               |
| ----------------- | --------- | ----------- | ------------------------- |
| `requests.cpu`    | string    | optional    | Minimum CPU allocation    |
| `requests.memory` | string    | optional    | Minimum memory allocation |
| `limits.cpu`      | string    | optional    | Maximum CPU allocation    |
| `limits.memory`   | string    | optional    | Maximum memory allocation |

#### `sparkCluster.driver`

**Description:** Configuration for the Spark driver that coordinates job execution.

| Data Type | Requirement | Default Value | Possible Value             |
| --------- | ----------- | ------------- | -------------------------- |
| mapping   | optional    | none          | Spark driver configuration |

**Sub-attributes:**

| Attribute     | Data Type | Requirement | Description                                                                                                  |
| ------------- | --------- | ----------- | ------------------------------------------------------------------------------------------------------------ |
| `coreLimit`   | string    | optional    | CPU core limit for the driver (e.g., `"2048m"`)                                                              |
| `cores`       | integer   | optional    | Number of CPU cores for the driver. Accepts whole numbers only.                                              |
| `coreRequest` | string    | optional    | CPU request for the driver in Kubernetes format (e.g., `"500m"`). Accepts fractional values, unlike `cores`. |
| `memory`      | string    | optional    | Memory for the driver (e.g., `"2048m"`)                                                                      |

**Example Usage:**

```yaml
driver:
  coreLimit: "2048m"
  cores: 1
  coreRequest: "500m"
  memory: "2048m"
```

#### `sparkCluster.executor`

**Description:** Configuration for the Spark executor instances that run the actual computation.

| Data Type | Requirement | Default Value | Possible Value               |
| --------- | ----------- | ------------- | ---------------------------- |
| mapping   | optional    | none          | Spark executor configuration |

**Sub-attributes:**

| Attribute     | Data Type | Requirement | Description                                                                                                |
| ------------- | --------- | ----------- | ---------------------------------------------------------------------------------------------------------- |
| `coreLimit`   | string    | optional    | CPU core limit per executor (e.g., `"2048m"`)                                                              |
| `cores`       | integer   | optional    | Number of CPU cores per executor. Accepts whole numbers only.                                              |
| `coreRequest` | string    | optional    | CPU request per executor in Kubernetes format (e.g., `"500m"`). Accepts fractional values, unlike `cores`. |
| `memory`      | string    | optional    | Memory per executor (e.g., `"2048m"`)                                                                      |
| `instances`   | integer   | optional    | Number of executor instances to launch                                                                     |

**Example Usage:**

```yaml
executor:
  coreLimit: "2048m"
  cores: 1
  coreRequest: "500m"
  memory: "2048m"
  instances: 1
```


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/lakehouse/configurations.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
