Skip to content

ADR-0009 — Opinionated stage architecture

Update date : 2026-05-27 21:55

Status: accepted Date: 2026-05-27

Context

Stages could be declared as generic resources with user-chosen names and manual configuration. Instead, the provider exposes a fixed set of semantic stage slots in manifest.yml. Each slot has a known role — the provider auto-configures grants, associated tasks, and behaviour based on the slot name, not on user configuration.

Decision

Four opinionated stage slots:

Slot Role Side effect
external_input_files External input — cloud storage (S3/Azure Blob/GCS) Auto-injects POLL_AND_DISPATCH task (schedule mode)
external_output_files External output — cloud storage (S3/Azure Blob/GCS) Provisions stage only
internal_exposed_files Internal — files exposed to other teams Provisions stage + VIEWER grants
resources_static Internal — static resources used by the pipeline Provisions stage only
stages:
  - external_input_files
  - external_output_files
  - internal_exposed_files
  - resources_static

Declaring [external_input_files] is sufficient — the provider knows what to provision and what to inject.

Bucket architecture convention

external_input_files and external_output_files imply an opinionated bucket structure:

# Input
external_input_files/{database_name}/{schema_name}/{provider_ref}/{ref_flux_provider}/
                                                                       ← files land here
# Stage URL (scoped at schema level)
external_input_files/{database_name}/{schema_name}/

# Output
external_output_files/{database_name}/{schema_name}/{consumer_ref}/{ref_flux_consumer}/
# Stage URL
external_output_files/{database_name}/{schema_name}/

The Snowflake stage is scoped at {database_name}/{schema_name}/ — one stage per schema, covering all providers/consumers and all fluxes. The provider/flux hierarchy exists in the bucket but is transparent to the stage.

Declaring the slot is enough: the provider derives the stage URL from {database_name}/{schema_name} in manifest.yml. No URL configuration needed.

Auto-provisioned registry tables

Each slot provisions a registry table alongside the stage:

Slot Registry Extra columns Purpose
external_input_files Tracks incoming files; stream feeds POLL_AND_DISPATCH / CONSUME_STREAM
external_output_files hash_sorted, hash_raw Non-regression: compare output hashes against gold hashes per job version
internal_exposed_files hash_sorted, hash_raw Same non-regression capability for exposed internal files
resources_static Static resources — no tracking needed

Hash columns: - hash_sorted — hash of file content with rows sorted → detects content changes regardless of row order - hash_raw — hash of file as-is → detects exact byte-level changes including row order

Non-regression pattern: feed the same input file in SANDBOX as a prod run that is known good. Compare the output hash in SANDBOX against the hash stored in the prod registry for that job version. Match = no regression. No separate test harness, no synthetic test data — real prod inputs, real hashes.

Full CI vision (not yet implemented):

A non-prod-only table per schema stores the gold dataset:

sp_version | input_file_name | outputs: list[{output_file_name, hash_raw, hash_sorted}]

CI pipeline: 1. Fetch latest gold version from the table 2. Upload reference input file to external_input_files → triggers the DAG automatically 3. Compare output hashes against gold 4. Pass / fail

The infrastructure (stage + stream + registry) is already in place — the CI layer is just the orchestration on top.

Why {provider_ref}/{ref_flux_provider}/

MFT tools generate opaque flux names (wkd_etl_01, etl_adp_06). Without a convention, files land as file_01, file_02 with no semantic. By embedding {provider_ref}/{ref_flux_provider}/ in the path, the S3 structure becomes the mapping: S3 path → MFT interface → file per flux — no separate mapping table to maintain.

Example: external_output_files/data_product__pluto_match/client_a/hubspot/export_leads/ is self-documenting. pinkysight reads the path hierarchy and can serve dimensional views out of the box: all incoming files by provider, all outgoing files by consumer, volume by flux — no additional metadata, no mapping table. The path structure IS the data model for file flow monitoring.

POLL_AND_DISPATCH polls external_input_files/{database_name}/{schema_name}/ and builds two lists passed to the business SP: - new_files — files not yet processed - existing_files — files already processed

The business SP receives these lists and is path-agnostic — it never needs to know about {provider_ref}/{ref_flux_provider}/. The path complexity is fully absorbed by the infrastructure layer.

Consequences

  • Zero configuration per stage: name = contract.
  • POLL_AND_DISPATCH injection is automatic when external_input_files is declared — no separate task declaration needed.
  • On high-volume file ingestion pipelines, the convention eliminates all naming debates: where to land files, where to put debug outputs, where processed copies go. The structure answers these questions before they're asked.
  • The predictable path structure enables pinkysight to monitor file flows by path automatically — no additional configuration needed because the convention is known at the suite level.
  • Users who need a stage outside these four slots use the generic stage resource type.
  • The opinionated slots encode a proven pipeline topology: cloud storage ingestion → extraction → internal data. Teams with a different topology use generic stages — the slots don't block them, they just don't help them.