Skip to content

Questions to ask (New Onboarding)

Find POCs first (Source, Business (or project owner), Analysts (or BI or DS teams))

For BI

  • Background (or use case)
  • Business Impact
  • Any targeted timeline for completion (for product releases)
  • Well defined (vs) Exploratory
  • Frequency
  • Granualarity (Raw/Aggr)
  • Realtime (vs) Offline (Batch?)
  • Target refresh frequency
  • SLA (or tolerance for delays)
  • Backfill requirement (How much recent data needed?)
  • Retention
  • DQ Checks for validation rules
  • Reporting requirements

For Source

  • Type of source and data
  • Frequency of data generation/publish (Velocity)
  • Volume
  • Mutability of data (Data doesn't change, Data contains PKs that get updates, Data doesn't have PK but get updates etc.)
  • How can the reconciliation between source and target be done?
  • Optional: Source to target column mapping (Kafka for example)

For DE Team

  • Infra & Cost
  • Bandwidth/resource allocation
  • Access Control
  • Compliance & regulatory

DE-specific things to check/evaluate for new jobs

  • Check if a new job is necessary
    • Can the requirement be handled by an existing job?
  • Validate DDLs
    • Are data types and constraints appropriate?
  • Ensure peer code reviews
    • Has the code been reviewed, with clear comments provided?
  • Handle job failures
    • Are failures and retries managed properly?
    • Check for Idempotency (running a job multiple times with the same input should produce the same result without unintended side effects. This ensures that if a job is retried (due to failure or other reasons), it won't duplicate or corrupt the output.)
    • Ensure a contigency plan for pipeline failures
  • Check backfill capability
    • Can the pipeline process historical data?
  • Verify job dependencies, if multiple jobs involved.
    • Are dependencies set correctly?
    • If dependencies can't be set, are CRON schedules set appropriately? (And ensure overlapping runs, due to schedule, won't cause issues)
  • Document thoroughly
    • Is the pipeline details and data model documented?
  • Optimize job performance and check Scalability readiness
    • Is the job optimized for execution time and resource usage? (e.g., parallelism, partitioning, and indexing)
    • Can the pipeline handle increasing data volumes and changes in workload efficiently?
  • Data Quality
    • Are data quality checks in place (e.g., null values, data type mismatches, duplicates)?
    • Is there a mechanism to detect and address data schema or quality changes over time?
  • Monitoring and Alerting
    • Are alerts configured for critical issues like job failures, data anomalies, or missed SLAs?