> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/lakehouse.md).

# Lakehouse

Lakehouse is a DataOS Resource that combines Apache Iceberg with cloud object storage. It gives you warehouse-style management on open storage formats.

When you create a Lakehouse, DataOS provisions the storage integration, a REST Catalog backed by PostgreSQL, and query and maintenance services.

### Key features of Lakehouse

* **Decoupled storage and compute:** Scale storage and compute independently.
* **ACID support:** Keep writes and reads consistent during concurrent operations.
* **Open table and file formats:** Use Apache Iceberg for metadata and Apache Parquet for data files.
* **Unified workload support:** Run analytics, transformation, and maintenance from one storage layer.
* **Managed orchestration:** Bundle REST Catalog, Spark Cluster, and Sherpa as one resource.

### Architecture of a Lakehouse

#### Storage

Lakehouse supports Amazon S3, ABFSS, and GCS. Data files are stored in Apache Parquet. Table metadata is managed with Apache Iceberg. Secrets hold the required credentials.

#### REST Catalog

The REST Catalog is the Lakehouse metastore. It stores Iceberg table metadata, including schema, snapshots, views, and file locations.

It uses a JDBC-backed catalog with PostgreSQL as the persistence layer. This enables reliable metadata storage and Iceberg view support.

When connecting to the REST Catalog through a Depot, the catalog address is expressed as two fields:

```yaml
metastoreUrl: <kong-proxy-service-name>.<environment-namespace>.svc.cluster.local:80
metastoreRelativePath: /<ingress-path>
```

#### Spark Cluster

Lakehouse provisions an embedded Spark Cluster (`sparkCluster`) that serves as the query and maintenance engine. It handles:

* Query execution and direct access through port-forwarding
* Maintenance operations such as compaction, manifest rewrite, and snapshot expiry

The Spark Cluster has three configurable components:

* **Server:** manages the cluster lifecycle
* **Driver:** coordinates the Spark job
* **Executor:** runs the actual computation

#### Sherpa Server

Sherpa runs alongside the REST Catalog. It handles:

1. SQL translation for the Spark Cluster
2. Maintenance job orchestration

The execution flow is:

```
User → REST Catalog → Sherpa Server → Spark Cluster
```

Sherpa queues requests and runs them one at a time. This prevents race conditions. It also retries failed operations and stores job status for later inspection.

#### Sherpa Worker

Sherpa uses a master-worker model. The server delegates queued work to the worker. The Spark Cluster processes the actual job execution.

### Create and manage a Lakehouse

#### Prerequisites

Before you create a Lakehouse, make sure you have:

* A Lakehouse domain in your instance
* A Flare stack provisioned and active
* A compute resource for the runtime components
* Access to a supported object storage account
* PostgreSQL connection details for the metastore
* To create a Lakehouse, you need a tenant-specific role (**Tenant Admin**, **Data Admin**, or **Data Developer**).
* To use a Lakehouse, you need resource-specific permission granted by the Lakehouse owner.

Verify that the Lakehouse domain exists:

```bash
dataos-ctl domain get -t resource -a
```

Verify that the Flare stack is present and active:

```bash
dataos-ctl develop stack versions
```

Supported storage backends:

* Amazon S3
* Azure Blob File System Secure (ABFSS)
* Google Cloud Storage (GCS)

Required PostgreSQL details:

* Host
* Port, usually `5432`
* Database
* Schema, usually `public`
* Username and password, stored in a Secret

{% hint style="warning" %}
Each Lakehouse must point to a **dedicated, isolated PostgreSQL database**. Do not reuse the same PostgreSQL database across multiple Lakehouses.

The PostgreSQL metastore stores the metadata file locations for every Iceberg table registered under a Lakehouse, including tables created by different users with different storage credentials. If two Lakehouses share the same database, the metastore intermixes metadata paths from both storage backends. When a scan runs against one Lakehouse, the REST Catalog returns file paths that belong to the other Lakehouse. The storage secret for the current Lakehouse does not have permission to access those paths, causing metadata operations to fail.

You can host multiple databases on the same PostgreSQL server. Each Lakehouse must reference its own dedicated database.
{% endhint %}

{% stepper %}
{% step %}

#### Create the required Secrets

Create these Secrets before you apply the Lakehouse manifest:

1. A metastore Secret for PostgreSQL credentials
2. A storage Secret for object storage credentials

**Metastore Secret**

```yaml
name: ${metastore-secret-name}
version: v2alpha
type: secret
description: PostgreSQL metastore secret for Lakehouse.
secret:
  type: key-value
  data:
    username: ${postgres-username}
    password: ${postgres-password}
```

**Storage Secret**

{% tabs %}
{% tab title="S3" %}

```yaml
name: ${storage-secret-name}
version: v2alpha
type: secret
description: S3 storage secret for Lakehouse.
secret:
  type: key-value
  data:
    aws_access_key: ${aws-access-key}
    aws_secret_key: ${aws-secret-key}
    storage_type: s3
```

{% endtab %}

{% tab title="ABFSS" %}

```yaml
name: ${storage-secret-name}
version: v2alpha
type: secret
description: ABFSS storage secret for Lakehouse.
secret:
  type: key-value
  data:
    az_account_name: ${azure-storage-account-name}
    az_account_key: ${azure-storage-account-key}
    storage_type: abfss
```

{% endtab %}

{% tab title="GCS" %}

```yaml
name: ${storage-secret-name}
version: v2alpha
type: secret
description: GCS storage secret for Lakehouse.
secret:
  type: key-value
  data:
    gcp_json_key: ${gcp-service-account-key}
    storage_type: gcs
```

You can also provide the GCS key as a file reference instead of an inline value:

```yaml
name: ${storage-secret-name}
version: v2alpha
type: secret
description: GCS storage secret for Lakehouse.
secret:
  type: key-value
  files:
    gcp_json_key: ${path-to-gcp-credentials-file}
  data:
    storage_type: gcs
```

{% endtab %}
{% endtabs %}

Apply the Secret manifest:

```bash
dataos-ctl resource apply -f ${secret-manifest-file-path}
```

Verify the Secret:

```bash
dataos-ctl resource get -t secret
```

{% endstep %}

{% step %}

#### Draft the Lakehouse manifest

The Lakehouse manifest has these sections:

* Resource metadata
* `spec`
  * `iceberg.metastore`
  * `iceberg.storage`
  * `iceberg.sherpa` (optional)
  * `iceberg.sparkCluster` (optional)

{% hint style="info" %}
For the full attribute reference, see [Lakehouse manifest configurations](/concepts/resources/lakehouse/configurations.md).
{% endhint %}

<details>

<summary>Sample Lakehouse Manifest</summary>

{% code title="lakehouse.yaml" %}

```yaml
name: s3pglh
version: v1alpha
type: lakehouse
description: Lakehouse on S3 storage
tags:
  - lakehouse
  - s3
spec:
  compute: ${compute-name}
  runAsUser: ${user-id}
  logLevel: INFO
  iceberg:
    metastore:
      type: iceberg-jdbc-catalog
      replicas: 1
      secret: ${tenant}:${metastore-secret-name}
      postgresql:
        host: ${postgres-host}
        port: "5432"
        database: ${database-name}
        schema: public
      hadoopConf:
        fs.s3a.connection.maximum: 1000
      resources:
        requests:
          cpu: "1"
          memory: 1000Mi
        limits:
          cpu: "1"
          memory: 2000Mi
    storage:
      type: s3
      s3:
        bucket: ${s3-bucket-name}
        relativePath: ${relative-path}
        scheme: s3a
        format: ICEBERG
      secret: ${tenant}:${storage-secret-name}
    sherpa:
      replicas: 1
      resources:
        requests:
          cpu: 200m
          memory: 512Mi
        limits:
          cpu: 500m
          memory: 1000Mi
    sparkCluster:
      server:
        requests:
          cpu: 200m
          memory: 512Mi
        limits:
          cpu: 500m
          memory: 1000Mi
      driver:
        coreLimit: "2048m"
        cores: 1
        memory: "2048m"
      executor:
        coreLimit: "2048m"
        cores: 1
        memory: "2048m"
        instances: 1
```

{% endcode %}

</details>

**Resource metadata**

{% tabs %}
{% tab title="Syntax" %}

```yaml
name: ${resource-name}
version: v1alpha
type: lakehouse
description: ${description}
tags:
  - ${tag1}
  - ${tag2}
```

{% endtab %}

{% tab title="Example" %}

```yaml
name: s3pglh
version: v1alpha
type: lakehouse
description: S3-backed Lakehouse for production analytics
tags:
  - lakehouse
  - s3
```

{% endtab %}
{% endtabs %}

**`spec`**

{% tabs %}
{% tab title="Syntax" %}

```yaml
spec:
  compute: ${compute-resource-name}
  runAsUser: ${user-id}
  logLevel: ${log-level}
  iceberg:
    metastore:
      # metastore configuration
    storage:
      # storage configuration
    sherpa:
      # sherpa configuration (optional)
    sparkCluster:
      # spark cluster configuration (optional)
```

{% endtab %}

{% tab title="Example" %}

```yaml
spec:
  compute: ironstorm-compute
  runAsUser: iamgroot
  logLevel: INFO
  iceberg:
    metastore:
      type: iceberg-jdbc-catalog
    storage:
      type: s3
    sherpa:
      replicas: 1
    sparkCluster:
      driver:
        cores: 1
        memory: "2048m"
      executor:
        instances: 1
```

{% endtab %}
{% endtabs %}

**Metastore**

{% tabs %}
{% tab title="Syntax" %}

```yaml
metastore:
  type: iceberg-jdbc-catalog
  replicas: ${number-of-replicas}
  secret: ${tenant}:${metastore-secret-name}
  postgresql:
    host: ${postgres-host}
    port: ${port}
    database: ${database-name}
    schema: ${schema-name}
  hadoopConf:
    ${hadoop-configuration-key}: ${value}
  resources:
    requests:
      cpu: ${cpu-request}
      memory: ${memory-request}
    limits:
      cpu: ${cpu-limit}
      memory: ${memory-limit}
```

{% endtab %}

{% tab title="Example" %}

```yaml
metastore:
  type: iceberg-jdbc-catalog
  replicas: 1
  secret: engineering:lhpgsecret
  postgresql:
    host: modern-postgresql-server.postgres.database.azure.com
    port: "5432"
    database: db_englh_s3
    schema: public
  hadoopConf:
    fs.s3a.connection.maximum: 1000
  resources:
    requests:
      cpu: "1"
      memory: 1000Mi
    limits:
      cpu: "1"
      memory: 2000Mi
```

{% endtab %}
{% endtabs %}

**Storage**

{% tabs %}
{% tab title="S3" %}

```yaml
storage:
  type: s3
  s3:
    bucket: ${s3-bucket-name}
    relativePath: ${relative-path}
    scheme: s3a
    format: ICEBERG
  secret: ${tenant}:${storage-secret-name}
```

{% endtab %}

{% tab title="ABFSS" %}

```yaml
storage:
  type: abfss
  abfss:
    account: ${abfss-account}
    container: ${container-name}
    relativePath: ${relative-path}
    endpointSuffix: dfs.core.windows.net
    format: ICEBERG
  secret: ${tenant}:${storage-secret-name}
```

{% endtab %}

{% tab title="GCS" %}

```yaml
storage:
  type: gcs
  gcs:
    bucket: ${gcs-bucket-name}
    relativePath: ${relative-path}
    format: ICEBERG
  secret: ${tenant}:${storage-secret-name}
```

{% endtab %}
{% endtabs %}

**Sherpa**

```yaml
sherpa:
  replicas: ${number-of-replicas}
  resources:
    requests:
      cpu: ${cpu-request}
      memory: ${memory-request}
    limits:
      cpu: ${cpu-limit}
      memory: ${memory-limit}
```

**Spark Cluster**

```yaml
sparkCluster:
  server:
    requests:
      cpu: ${cpu-request}
      memory: ${memory-request}
    limits:
      cpu: ${cpu-limit}
      memory: ${memory-limit}
  driver:
    coreLimit: ${core-limit}
    cores: ${number-of-cores}
    coreRequest: ${cpu-request}
    memory: ${memory}
  executor:
    coreLimit: ${core-limit}
    cores: ${number-of-cores}
    coreRequest: ${cpu-request}
    memory: ${memory}
    instances: ${number-of-instances}
```

`coreRequest` accepts Kubernetes-style fractional CPU values (for example, `500m`). Unlike `cores`, which only accepts whole-number values, `coreRequest` allows fine-grained CPU scheduling.
{% endstep %}

{% step %}

#### Apply the Lakehouse manifest

Apply the manifest with the CLI:

```bash
dataos-ctl resource apply -f ${manifest-file-path}
```

Example:

```bash
dataos-ctl resource apply -f ./lakehouse/s3-lakehouse.yaml
```

{% endstep %}

{% step %}

#### Verify and inspect the Lakehouse

Check Lakehouse status:

```bash
dataos-ctl resource get -t lakehouse
```

List Lakehouses created by all users:

```bash
dataos-ctl resource get -t lakehouse -a
```

Inspect a specific Lakehouse:

```bash
dataos-ctl resource get -t lakehouse -n ${lakehouse-name} -d
```

Get build details:

```bash
dataos-ctl resource get -t lakehouse -n ${lakehouse-name} -b
```

Useful inspection flags:

* `-d` shows submitted spec, runtime state, status, and active properties
* `-b` shows the Kubernetes resources created during the build

Connectivity options:

* Use `tcp stream` to reach services inside the cluster
* Port-forward to the Spark Cluster for direct query access

Delete a Lakehouse:

```bash
dataos-ctl resource delete -i "${name} | ${version} | lakehouse"
```

```bash
dataos-ctl resource delete -f ${manifest-file-path}
```

```bash
dataos-ctl resource delete -t lakehouse -n ${lakehouse-name}
```

{% endstep %}
{% endstepper %}

### Create a Lakehouse Depot

A Depot of type `lakehouse` connects to an existing Lakehouse instance, exposing its REST Catalog and object storage through the DataOS catalog layer. Create the Depot after the Lakehouse is running.

The `metastoreUrl` and `metastoreRelativePath` fields identify the REST Catalog service. Use the internal Kubernetes service address for `metastoreUrl`.

{% tabs %}
{% tab title="S3" %}

```yaml
name: ${depot-name}
version: v2alpha
type: depot
description: "Lakehouse depot: Iceberg REST catalog + S3 storage"
tags:
  - lakehouse
  - s3
depot:
  type: lakehouse
  description: "Iceberg data depot backed by ${lakehouse-name}"
  spec:
    storageType: s3
    catalogType: REST
    metastoreUrl: <kong-proxy-service-name>.<environment-namespace>.svc.cluster.local:80
    metastoreRelativePath: /<ingress-path>
    s3:
      bucket: ${s3-bucket-name}
      relativePath: ${relative-path}
      format: ICEBERG
      region: ${aws-region}
  secrets:
    - id: ${tenant}:${secret-name}ds
      purpose: rw
```

{% endtab %}

{% tab title="ABFSS" %}

```yaml
name: ${depot-name}
version: v2alpha
type: depot
description: "Lakehouse depot: Iceberg REST catalog + ABFSS storage"
tags:
  - lakehouse
  - abfss
depot:
  type: lakehouse
  description: "Iceberg data depot backed by ${lakehouse-name}"
  spec:
    storageType: abfss
    catalogType: REST
    metastoreUrl: <kong-proxy-service-name>.<environment-namespace>.svc.cluster.local:80
    metastoreRelativePath: /<ingress-path>
    abfss:
      account: ${abfss-account}
      container: ${container-name}
      endpointSuffix: ${endpoint-suffix}
      relativePath: ${relative-path}
      format: ICEBERG
  secrets:
    - id: ${tenant}:${secret-name}ds
      purpose: rw
```

{% endtab %}

{% tab title="GCS" %}

```yaml
name: ${depot-name}
version: v2alpha
type: depot
description: "Lakehouse depot: Iceberg REST catalog + GCS storage"
tags:
  - lakehouse
  - gcs
depot:
  type: lakehouse
  description: "Iceberg data depot backed by ${lakehouse-name}"
  spec:
    storageType: gcs
    catalogType: REST
    metastoreUrl: <kong-proxy-service-name>.<environment-namespace>.svc.cluster.local:80
    metastoreRelativePath: /<ingress-path>
    gcs:
      bucket: ${gcs-bucket-name}
      relativePath: ${relative-path}
      format: ICEBERG
  secrets:
    - id: ${tenant}:${secret-name}ds
      purpose: rw
```

{% endtab %}
{% endtabs %}

Apply the Depot manifest:

```bash
dataos-ctl resource apply -f ${depot-manifest-file-path}
```

Verify the Depot:

```bash
dataos-ctl resource get -t depot
```

### Supported data format

* Lakehouse uses Apache Iceberg as the table format
* Lakehouse stores data in Apache Parquet files
* The Spark Cluster can also handle standalone Parquet use cases outside Lakehouse

### Use a Lakehouse in DataOS

* Manage operations and datasets with the [Lakehouse command reference](/concepts/resources/lakehouse/command-reference.md)
* Review manifest fields in [Lakehouse manifest configurations](/concepts/resources/lakehouse/configurations.md)


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/lakehouse.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
