ADR-0009 — Opinionated stage architecture

Update date : 2026-05-27 21:55

Status: accepted Date: 2026-05-27

Context

Stages could be declared as generic resources with user-chosen names and manual configuration. Instead, the provider exposes a fixed set of semantic stage slots in manifest.yml. Each slot has a known role — the provider auto-configures grants, associated tasks, and behaviour based on the slot name, not on user configuration.

Decision

Four opinionated stage slots:

Slot	Role	Side effect
`external_input_files`	External input — cloud storage (S3/Azure Blob/GCS)	Auto-injects `POLL_AND_DISPATCH` task (schedule mode)
`external_output_files`	External output — cloud storage (S3/Azure Blob/GCS)	Provisions stage only
`internal_exposed_files`	Internal — files exposed to other teams	Provisions stage + `VIEWER` grants
`resources_static`	Internal — static resources used by the pipeline	Provisions stage only

stages:
  - external_input_files
  - external_output_files
  - internal_exposed_files
  - resources_static

Declaring [external_input_files] is sufficient — the provider knows what to provision and what to inject.

Bucket architecture convention

external_input_files and external_output_files imply an opinionated bucket structure:

# Input
external_input_files/{database_name}/{schema_name}/{provider_ref}/{ref_flux_provider}/
                                                                       ← files land here
# Stage URL (scoped at schema level)
external_input_files/{database_name}/{schema_name}/

# Output
external_output_files/{database_name}/{schema_name}/{consumer_ref}/{ref_flux_consumer}/
# Stage URL
external_output_files/{database_name}/{schema_name}/

The Snowflake stage is scoped at {database_name}/{schema_name}/ — one stage per schema, covering all providers/consumers and all fluxes. The provider/flux hierarchy exists in the bucket but is transparent to the stage.

Declaring the slot is enough: the provider derives the stage URL from {database_name}/{schema_name} in manifest.yml. No URL configuration needed.

Auto-provisioned registry tables

Each slot provisions a registry table alongside the stage:

Slot	Registry	Extra columns	Purpose
`external_input_files`	✓	—	Tracks incoming files; stream feeds `POLL_AND_DISPATCH` / `CONSUME_STREAM`
`external_output_files`	✓	`hash_sorted`, `hash_raw`	Non-regression: compare output hashes against gold hashes per job version
`internal_exposed_files`	✓	`hash_sorted`, `hash_raw`	Same non-regression capability for exposed internal files
`resources_static`	—	—	Static resources — no tracking needed

Hash columns: - hash_sorted — hash of file content with rows sorted → detects content changes regardless of row order - hash_raw — hash of file as-is → detects exact byte-level changes including row order

Non-regression pattern: feed the same input file in SANDBOX as a prod run that is known good. Compare the output hash in SANDBOX against the hash stored in the prod registry for that job version. Match = no regression. No separate test harness, no synthetic test data — real prod inputs, real hashes.

Full CI vision (not yet implemented):

A non-prod-only table per schema stores the gold dataset:

sp_version | input_file_name | outputs: list[{output_file_name, hash_raw, hash_sorted}]

CI pipeline: 1. Fetch latest gold version from the table 2. Upload reference input file to external_input_files → triggers the DAG automatically 3. Compare output hashes against gold 4. Pass / fail

The infrastructure (stage + stream + registry) is already in place — the CI layer is just the orchestration on top.

Why `{provider_ref}/{ref_flux_provider}/`

MFT tools generate opaque flux names (wkd_etl_01, etl_adp_06). Without a convention, files land as file_01, file_02 with no semantic. By embedding {provider_ref}/{ref_flux_provider}/ in the path, the S3 structure becomes the mapping: S3 path → MFT interface → file per flux — no separate mapping table to maintain.

Example: external_output_files/data_product__pluto_match/client_a/hubspot/export_leads/ is self-documenting. pinkysight reads the path hierarchy and can serve dimensional views out of the box: all incoming files by provider, all outgoing files by consumer, volume by flux — no additional metadata, no mapping table. The path structure IS the data model for file flow monitoring.

POLL_AND_DISPATCH polls external_input_files/{database_name}/{schema_name}/ and builds two lists passed to the business SP: - new_files — files not yet processed - existing_files — files already processed

The business SP receives these lists and is path-agnostic — it never needs to know about {provider_ref}/{ref_flux_provider}/. The path complexity is fully absorbed by the infrastructure layer.

Consequences

Zero configuration per stage: name = contract.
POLL_AND_DISPATCH injection is automatic when external_input_files is declared — no separate task declaration needed.
On high-volume file ingestion pipelines, the convention eliminates all naming debates: where to land files, where to put debug outputs, where processed copies go. The structure answers these questions before they're asked.
The predictable path structure enables pinkysight to monitor file flows by path automatically — no additional configuration needed because the convention is known at the suite level.
Users who need a stage outside these four slots use the generic stage resource type.
The opinionated slots encode a proven pipeline topology: cloud storage ingestion → extraction → internal data. Teams with a different topology use generic stages — the slots don't block them, they just don't help them.