Dynamic DAG loading with pinned OSDU package versions
Context and Problem Statement
Airflow DAGs for OSDU ingestion workflows depend on Python packages (osdu-airflow, osdu-ingestion, osdu-api) that must be version-compatible with the DAGs themselves. The DAGs are maintained in an upstream repository (ingestion-dags), not in this infrastructure repo. We need a strategy that keeps DAGs current with their golden source while ensuring version compatibility between DAGs and their runtime dependencies.
Decision Drivers
- DAGs copied into the infra repo become stale — they never receive upstream fixes or new workflows
- Tight version coupling exists between DAGs and
osdu-airflow/osdu-ingestionpackages (same release train) - The stock
apache/airflowimage does not include OSDU-specific packages - ROSA implementation uses custom-built images with packages baked in — not portable to AKS without a container registry and build pipeline
- The
osdu-developerreference implementation installs packages via pip at startup and downloads DAGs at deploy time
Considered Options
- Static DAG copies in the infra repo with no OSDU packages (current state — 9 of 11 DAGs broken)
- Custom Airflow container image with packages baked in (ROSA approach)
- Dynamic DAG download + pip install at startup (osdu-developer approach)
Decision Outcome
Chosen option: "Dynamic DAG download + pip install at startup", because it keeps DAGs sourced from the upstream repository, avoids custom image builds, and allows a single version variable to control the entire dependency chain.
Implementation
Three Terraform variables control the OSDU Airflow package versions:
osdu_airflow_version(default:0.29.2) — osdu-airflow package and ingestion DAGs source tagosdu_ingestion_version(default:0.29.0) — osdu-ingestion packageosdu_api_version(default:1.1.0) — osdu-api package
These are separate because the packages follow independent release cadences.
-
DAG source: A
null_resourcedownloads the ingestion-dags archive from the upstream GitLab repository at the tag matchingosdu_airflow_version(e.g.,v0.29.2) and creates aningestion-dagsConfigMap. -
Python packages: The
_PIP_ADDITIONAL_REQUIREMENTSenvironment variable installs the three packages at their pinned versions from OSDU's private PyPI registries (using--extra-index-urlpointing to OSDU GitLab package registries). -
Service URLs:
AIRFLOW_VAR_*environment variables point DAGs to in-cluster OSDU service endpoints. -
DAG loading: Two ConfigMaps are merged via init containers:
airflow-dags— local generic DAGs (echo, monitoring) with no external dependenciesingestion-dags— downloaded from upstream, markedoptional: trueso Airflow starts even if the download fails
To upgrade OSDU versions: update the version variables and re-apply. The DAG source tag follows osdu_airflow_version.
Consequences
- Good, because DAGs come from the golden source repository and receive upstream updates on version bump
- Good, because three version variables make the DAG/package coupling explicit and independently updatable
- Good, because no custom container image or registry required — uses stock
apache/airflow - Good, because packages are pinned to exact versions, preventing unexpected upgrades
- Good, because the
ingestion-dagsConfigMap is optional — Airflow starts with generic DAGs even if the download fails - Bad, because pip install at container startup adds 30-60 seconds to pod start time
- Bad, because the OSDU PyPI registries must be reachable from the deployment machine at
terraform applytime and from the cluster at pod startup - Bad, because
_PIP_ADDITIONAL_REQUIREMENTSis a startup-time mechanism — if a package version is unavailable, pods will crash-loop rather than fail at deploy time