Questions to ask (New Onboarding)

Find POCs first (Source, Business (or project owner), Analysts (or BI or DS teams))

For BI

Background (or use case)
Business Impact
Any targeted timeline for completion (for product releases)
Well defined (vs) Exploratory
Frequency
Granualarity (Raw/Aggr)
Realtime (vs) Offline (Batch?)
Target refresh frequency
SLA (or tolerance for delays)
Backfill requirement (How much recent data needed?)
Retention
DQ Checks for validation rules
Reporting requirements

For Source

Type of source and data
Frequency of data generation/publish (Velocity)
Volume
Mutability of data (Data doesn't change, Data contains PKs that get updates, Data doesn't have PK but get updates etc.)
How can the reconciliation between source and target be done?
Optional: Source to target column mapping (Kafka for example)

For DE Team

Infra & Cost
Bandwidth/resource allocation
Access Control
Compliance & regulatory

DE-specific things to check/evaluate for new jobs

Check if a new job is necessary
- Can the requirement be handled by an existing job?
Validate DDLs
- Are data types and constraints appropriate?
Ensure peer code reviews
- Has the code been reviewed, with clear comments provided?
Handle job failures
- Are failures and retries managed properly?
- Check for Idempotency (running a job multiple times with the same input should produce the same result without unintended side effects. This ensures that if a job is retried (due to failure or other reasons), it won't duplicate or corrupt the output.)
- Ensure a contigency plan for pipeline failures
Check backfill capability
- Can the pipeline process historical data?
Verify job dependencies, if multiple jobs involved.
- Are dependencies set correctly?
- If dependencies can't be set, are CRON schedules set appropriately? (And ensure overlapping runs, due to schedule, won't cause issues)
Document thoroughly
- Is the pipeline details and data model documented?
Optimize job performance and check Scalability readiness
- Is the job optimized for execution time and resource usage? (e.g., parallelism, partitioning, and indexing)
- Can the pipeline handle increasing data volumes and changes in workload efficiently?
Data Quality
- Are data quality checks in place (e.g., null values, data type mismatches, duplicates)?
- Is there a mechanism to detect and address data schema or quality changes over time?
Monitoring and Alerting
- Are alerts configured for critical issues like job failures, data anomalies, or missed SLAs?

Questions to ask (New Onboarding) ​

For BI ​

For Source ​

For DE Team ​

DE-specific things to check/evaluate for new jobs ​

Questions to ask (New Onboarding)

For BI

For Source

For DE Team

DE-specific things to check/evaluate for new jobs