ADR-0009 — Opinionated stage architecture
Update date : 2026-05-27 21:55
Status: accepted Date: 2026-05-27
Context
Stages could be declared as generic resources with user-chosen names and manual configuration.
Instead, the provider exposes a fixed set of semantic stage slots in manifest.yml.
Each slot has a known role — the provider auto-configures grants, associated tasks, and
behaviour based on the slot name, not on user configuration.
Decision
Four opinionated stage slots:
| Slot | Role | Side effect |
|---|---|---|
external_input_files |
External input — cloud storage (S3/Azure Blob/GCS) | Auto-injects POLL_AND_DISPATCH task (schedule mode) |
external_output_files |
External output — cloud storage (S3/Azure Blob/GCS) | Provisions stage only |
internal_exposed_files |
Internal — files exposed to other teams | Provisions stage + VIEWER grants |
resources_static |
Internal — static resources used by the pipeline | Provisions stage only |
stages:
- external_input_files
- external_output_files
- internal_exposed_files
- resources_static
Declaring [external_input_files] is sufficient — the provider knows what to provision and what to inject.
Bucket architecture convention
external_input_files and external_output_files imply an opinionated bucket structure:
# Input
external_input_files/{database_name}/{schema_name}/{provider_ref}/{ref_flux_provider}/
← files land here
# Stage URL (scoped at schema level)
external_input_files/{database_name}/{schema_name}/
# Output
external_output_files/{database_name}/{schema_name}/{consumer_ref}/{ref_flux_consumer}/
# Stage URL
external_output_files/{database_name}/{schema_name}/
The Snowflake stage is scoped at {database_name}/{schema_name}/ — one stage per schema, covering all
providers/consumers and all fluxes. The provider/flux hierarchy exists in the bucket
but is transparent to the stage.
Declaring the slot is enough: the provider derives the stage URL from {database_name}/{schema_name} in manifest.yml.
No URL configuration needed.
Auto-provisioned registry tables
Each slot provisions a registry table alongside the stage:
| Slot | Registry | Extra columns | Purpose |
|---|---|---|---|
external_input_files |
✓ | — | Tracks incoming files; stream feeds POLL_AND_DISPATCH / CONSUME_STREAM |
external_output_files |
✓ | hash_sorted, hash_raw |
Non-regression: compare output hashes against gold hashes per job version |
internal_exposed_files |
✓ | hash_sorted, hash_raw |
Same non-regression capability for exposed internal files |
resources_static |
— | — | Static resources — no tracking needed |
Hash columns:
- hash_sorted — hash of file content with rows sorted → detects content changes regardless of row order
- hash_raw — hash of file as-is → detects exact byte-level changes including row order
Non-regression pattern: feed the same input file in SANDBOX as a prod run that is known good. Compare the output hash in SANDBOX against the hash stored in the prod registry for that job version. Match = no regression. No separate test harness, no synthetic test data — real prod inputs, real hashes.
Full CI vision (not yet implemented):
A non-prod-only table per schema stores the gold dataset:
sp_version | input_file_name | outputs: list[{output_file_name, hash_raw, hash_sorted}]
CI pipeline:
1. Fetch latest gold version from the table
2. Upload reference input file to external_input_files → triggers the DAG automatically
3. Compare output hashes against gold
4. Pass / fail
The infrastructure (stage + stream + registry) is already in place — the CI layer is just the orchestration on top.
Why {provider_ref}/{ref_flux_provider}/
MFT tools generate opaque flux names (wkd_etl_01, etl_adp_06). Without a convention, files land
as file_01, file_02 with no semantic. By embedding {provider_ref}/{ref_flux_provider}/ in the
path, the S3 structure becomes the mapping: S3 path → MFT interface → file per flux — no separate
mapping table to maintain.
Example: external_output_files/data_product__pluto_match/client_a/hubspot/export_leads/ is self-documenting.
pinkysight reads the path hierarchy and can serve dimensional views out of the box:
all incoming files by provider, all outgoing files by consumer, volume by flux — no additional
metadata, no mapping table. The path structure IS the data model for file flow monitoring.
POLL_AND_DISPATCH polls external_input_files/{database_name}/{schema_name}/ and builds two lists passed to the
business SP:
- new_files — files not yet processed
- existing_files — files already processed
The business SP receives these lists and is path-agnostic — it never needs to know about
{provider_ref}/{ref_flux_provider}/. The path complexity is fully absorbed by the infrastructure layer.
Consequences
- Zero configuration per stage: name = contract.
POLL_AND_DISPATCHinjection is automatic whenexternal_input_filesis declared — no separate task declaration needed.- On high-volume file ingestion pipelines, the convention eliminates all naming debates: where to land files, where to put debug outputs, where processed copies go. The structure answers these questions before they're asked.
- The predictable path structure enables pinkysight to monitor file flows by path automatically — no additional configuration needed because the convention is known at the suite level.
- Users who need a stage outside these four slots use the generic
stageresource type. - The opinionated slots encode a proven pipeline topology: cloud storage ingestion → extraction → internal data. Teams with a different topology use generic stages — the slots don't block them, they just don't help them.