> For the complete documentation index, see [llms.txt](https://v2.dataos.info/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/google-bigquery.md).

# Google BigQuery

[Google BigQuery](https://cloud.google.com/bigquery/docs) is Google Cloud's serverless data warehouse. Nilus reads from BigQuery as a **batch source** through the standard BigQuery SQL path, with optional incremental ingestion driven by a monotonic key column.

For information on writing **into** BigQuery, see the [BigQuery](/concepts/resources/nilus/destinations/cloud-warehouses/bigquery.md).

## Requirements

Connectivity and credentials must both be in place before the pipeline can run.

### Connectivity

* The Nilus runtime must reach `bigquery.googleapis.com` over HTTPS.
* Credentials are resolved from one of three sources, in priority order:
  1. `credentials_path` query parameter on the URI (path to a service-account JSON file inside the runtime).
  2. `credentials_base64` query parameter (base64-encoded service-account JSON).
  3. Google Application Default Credentials (ADC): workload identity, GKE metadata server, or a `GOOGLE_APPLICATION_CREDENTIALS` env var.

### Required parameters

| Parameter            | Required    | Default | Description                                                                                                               |
| -------------------- | ----------- | ------- | ------------------------------------------------------------------------------------------------------------------------- |
| `project`            | Yes         | -       | Google Cloud project ID that owns the dataset being read. Embedded as the URI host (`bigquery://<project>`).              |
| `credentials_path`   | Conditional | -       | Path to a service-account JSON file inside the runtime. Provide this **or** `credentials_base64` **or** rely on ADC.      |
| `credentials_base64` | Conditional | -       | Base64-encoded service-account JSON, supplied as a URI query parameter when mounting a credentials file is not practical. |
| `location`           | No          | `US`    | BigQuery region for jobs (e.g. `US`, `EU`, `asia-southeast1`). Must match the dataset's location.                         |

### Service-account permissions

The service account used by the connector needs:

* `roles/bigquery.dataViewer` on the source dataset (or specific tables): read access.
* `roles/bigquery.jobUser` on the project, permission to run query jobs.

For very large extracts where BigQuery streams query results through GCS, also grant:

* `roles/storage.objectViewer` on the staging bucket the source query export targets.

### URI format

```
bigquery://<project>?credentials_path=/path/to/sa.json&location=US
```

Base64-encoded credentials variant (useful when secrets are managed via env vars):

```
bigquery://<project>?credentials_base64=<base64-of-service-account-json>&location=US
```

ADC variant (no credentials in the URI):

```
bigquery://<project>?location=US
```

Depot-backed equivalent (recommended for production):

```
dataos://<bigquery-depot>?purpose=rw
```

### Depot-backed setup

For production, define a BigQuery **depot** plus a **secret** that holds the GCP service-account JSON. Nilus projects the key into the runtime automatically, so you don't pass `credentials_path` / `credentials_base64` in the address.

**Secret:** the service-account JSON under `data.gcp_json_key`:

```yaml
name: nilusbqsecret
version: v2alpha
type: secret
description: GCP service-account key for the BigQuery depot
layer: user
secret:
  type: key-value
  data:
    gcp_json_key: |
      {
        "type": "service_account",
        "project_id": "<gcp-project>",
        "private_key_id": "<key-id>",
        "private_key": "-----BEGIN PRIVATE KEY-----\n<...>\n-----END PRIVATE KEY-----\n",
        "client_email": "<sa-name>@<gcp-project>.iam.gserviceaccount.com",
        "client_id": "<client-id>",
        "token_uri": "https://oauth2.googleapis.com/token"
      }
```

**Depot:** references the secret per purpose (`scan` for metadata, `rw`/`query` for read and write):

```yaml
name: nilusbqdepot
version: v2alpha
type: depot
tags:
  - bigquery
layer: user
description: "BigQuery data-warehouse depot"
spec:
  type: bigquery
  spec:
    project: "<gcp-project>"
  secrets:
    - id: "<workspace>:nilusbqsecret"
      purpose: scan
    - id: "<workspace>:nilusbqsecret"
      purpose: rw
    - id: "<workspace>:nilusbqsecret"
      purpose: query
```

The pipeline then references only the depot, e.g. `address: dataos://nilusbqdepot?purpose=rw`.

## Source options

| Option                 | Required | Description                                                                                                                                                                       |
| ---------------------- | -------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `source_table`         | Yes      | Table reference in `<dataset>.<table>` form. Cross-project reads use `<project>.<dataset>.<table>`. To extract from a hand-authored SQL query, use `source_table: "query:<SQL>"`. |
| `incremental_key`      | No       | Timestamp or numeric column used to identify newly visible rows for each run. BigQuery `TIMESTAMP` / `DATETIME` columns or sequence-backed `id` are common picks.                 |
| `interval_start`       | No       | Optional ISO-8601 lower bound for the extraction window.                                                                                                                          |
| `interval_end`         | No       | Optional ISO-8601 upper bound for the extraction window.                                                                                                                          |
| `page_size`            | No       | Rows per extraction batch (default `50000`).                                                                                                                                      |
| `sql_reflection_level` | No       | `full` (default) or reduced, controls how thoroughly Nilus reflects the source schema before extraction.                                                                          |
| `sql_limit`            | No       | Caps total rows extracted per run. Use for sampling and validation.                                                                                                               |
| `sql_exclude_columns`  | No       | Comma-separated column names to skip during extraction.                                                                                                                           |
| `type_hints`           | No       | Object map of `column_name: <type>` to override inferred types. Supported types: `text`, `bigint`, `bool`, `timestamp`, `date`, `decimal`, `double`, `binary`, `json`, `time`.    |
| `max_table_nesting`    | No       | String. `"0"` for fully flattened analytics-friendly output. Raise to `"1"` to preserve a single level of `STRUCT` nesting; `"2"` for two levels.                                 |

> **Note** There is **no** separate `staging_bucket` option on the source side. `staging_bucket` is a sink-side option used by warehouse destinations that load via external stage; on the source side, BigQuery handles result export internally.

## Sample Nilus configs

Each example below is self-contained and uses the current Nilus pipeline shape.

### Batch, incremental ingestion to Lakehouse

```yaml
name: nilus-bigquery-batch
version: v1alpha
type: nilus
description: BigQuery → DataOS Lakehouse incremental snapshot
spec:
  type: batch
  compute: runnable-default
  source:
    address: dataos://bigquery-depot
    options:
      source_table: analytics.orders
      incremental_key: updated_at
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: sales.orders_raw
      incremental_strategy: merge
      aws_region: us-west-2
```

### Batch, query-driven extract with `STRUCT` flattening

```yaml
spec:
  type: batch
  source:
    address: bigquery://my-project?credentials_base64={BIGQUERY_SA_B64}&location=US
    options:
      source_table: "query:SELECT id, amount, customer.country AS country FROM analytics.orders WHERE event_date >= CURRENT_DATE - 30"
      max_table_nesting: "0"
  sink:
    address: dataos://analytics-lakehouse
    options:
      dest_table: sales.orders_recent
      incremental_strategy: replace
```

## Behavior and capabilities

* **Compute model**: Nilus connects through the SQLAlchemy BigQuery dialect and submits parameterized SQL reads. Each read becomes a BigQuery query job.
* **Object model**: BigQuery datasets and tables (or views) addressed as `<dataset>.<table>`, optionally `<project>.<dataset>.<table>` for cross-project reads. The SQLAlchemy dialect treats `<dataset>` like a schema.
* **Pipeline mode**: `batch` only on this page.
* **Authentication priority**: service-account JSON wins over ADC. If both `credentials_path` and `credentials_base64` are unset, the dialect falls back to Application Default Credentials so workload-identity setups (GKE, Cloud Run) work without explicit secrets.
* **Custom queries**: `source_table: "query:SELECT ..."` is the right shape for any non-trivial extract (joins, partition-aware filters, struct flattening).
* **Cost control**: BigQuery bills per byte processed by query jobs. Effective tactics: ensure `incremental_key` is on a partition column so the engine prunes partitions; use the `query:` form to project only the columns you need; set `sql_limit` for validation runs to cap bytes.

## Troubleshooting

| Symptom                                                                  | Likely cause                                                                                                                                                    | Resolution                                                                                                                              |
| ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- |
| `403 Access Denied: BigQuery BigQuery: Permission denied` (read)         | Service account lacks `roles/bigquery.dataViewer` on the dataset or table.                                                                                      | Grant `roles/bigquery.dataViewer` on the dataset, or scope the grant to specific tables.                                                |
| `403 … bigquery.jobs.create`                                             | Service account lacks `roles/bigquery.jobUser` on the project.                                                                                                  | Grant `roles/bigquery.jobUser` on the project that issues the query.                                                                    |
| `Not found: Dataset <project>:<dataset>`                                 | Dataset is in a different project than the URI's host, or the project ID is wrong.                                                                              | Use the cross-project form `source_table: "<project>.<dataset>.<table>"`, or set the URI host to the right project.                     |
| `400 Cannot read in location 'us-east1' …`                               | The dataset lives in a region different from the connector's `location`.                                                                                        | Set the URI's `location` query parameter to the dataset's location (e.g. `EU`, `asia-southeast1`).                                      |
| `DefaultCredentialsError: Could not automatically determine credentials` | Neither `credentials_path` / `credentials_base64` was set, and the runtime has no ADC available.                                                                | Either add credentials to the URI, mount a service-account file via `credentials_path`, or configure workload identity for the runtime. |
| Large query bills bytes way above expected size                          | The query reads partitions outside the desired range, typically because `incremental_key` is not on a partition column, or no `WHERE` clause prunes partitions. | Make sure `incremental_key` matches the table's partition column; add explicit partition filters in the `query:` form.                  |
| Run extracts the full table every time despite `incremental_key`         | The column is not monotonically increasing, or the pipeline state was reset between runs.                                                                       | Pick a column that is genuinely monotonic; consider `_PARTITIONTIME` for ingestion-time partitioned tables.                             |

## Related docs

* [BigQuery](/concepts/resources/nilus/destinations/cloud-warehouses/bigquery.md): companion destination connector.
* [Optimize Sink Datasets](/concepts/resources/nilus/pipeline-optimization/optimize-sink-datasets.md): guidance on `incremental_strategy`, `partition_by`, `cluster_by`, and other shape settings.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://v2.dataos.info/concepts/resources/nilus/batch/batch-sources/google-bigquery.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
