Questions to ask (New Onboarding)
Find POCs first (Source, Business (or project owner), Analysts (or BI or DS teams))
For BI
- Background (or use case)
- Business Impact
- Any targeted timeline for completion (for product releases)
- Well defined (vs) Exploratory
- Frequency
- Granualarity (Raw/Aggr)
- Realtime (vs) Offline (Batch?)
- Target refresh frequency
- SLA (or tolerance for delays)
- Backfill requirement (How much recent data needed?)
- Retention
- DQ Checks for validation rules
- Reporting requirements
For Source
- Type of source and data
- Frequency of data generation/publish (Velocity)
- Volume
- Mutability of data (Data doesn't change, Data contains PKs that get updates, Data doesn't have PK but get updates etc.)
- How can the reconciliation between source and target be done?
- Optional: Source to target column mapping (Kafka for example)
For DE Team
- Infra & Cost
- Bandwidth/resource allocation
- Access Control
- Compliance & regulatory
DE-specific things to check/evaluate for new jobs
- Check if a new job is necessary
- Can the requirement be handled by an existing job?
- Validate DDLs
- Are data types and constraints appropriate?
- Ensure peer code reviews
- Has the code been reviewed, with clear comments provided?
- Handle job failures
- Are failures and retries managed properly?
- Check for Idempotency (running a job multiple times with the same input should produce the same result without unintended side effects. This ensures that if a job is retried (due to failure or other reasons), it won't duplicate or corrupt the output.)
- Ensure a contigency plan for pipeline failures
- Check backfill capability
- Can the pipeline process historical data?
- Verify job dependencies, if multiple jobs involved.
- Are dependencies set correctly?
- If dependencies can't be set, are CRON schedules set appropriately? (And ensure overlapping runs, due to schedule, won't cause issues)
- Document thoroughly
- Is the pipeline details and data model documented?
- Optimize job performance and check Scalability readiness
- Is the job optimized for execution time and resource usage? (e.g., parallelism, partitioning, and indexing)
- Can the pipeline handle increasing data volumes and changes in workload efficiently?
- Data Quality
- Are data quality checks in place (e.g., null values, data type mismatches, duplicates)?
- Is there a mechanism to detect and address data schema or quality changes over time?
- Monitoring and Alerting
- Are alerts configured for critical issues like job failures, data anomalies, or missed SLAs?
