Data pipeline#

Depending on roles and mindset there are different ways to think of the data pipeline in a project or business. Some, possibly exaggerated/pre-concieved, viewpoints:

  • Researcher: Concerned with the validity and accuracy of the data, as well as the methods used to collect and clean it.

    • Ensure that the data is collected in a way that is representative of the population of interest.

    • Clean the data to remove any errors or inconsistencies.

    • Use appropriate statistical methods to analyze the data.

    • Interpret the results of the analysis and draw conclusions.

  • Engineer: Responsible for the design and implementation of the data pipeline. May also be involved in the maintenance and troubleshooting of the system.

    • Design a data pipeline that is efficient and scalable.

    • Implement the data pipeline using secure and reliable technologies.

    • Monitor the data pipeline for performance and errors.

    • Troubleshoot and fix problems with the data pipeline.

In a business you might also have:#

  • Middle manager: Responsible for overseeing the data pipeline and ensuring that it meets the needs of the organization. May also be involved in the budgeting and staffing of the data team.

    • Set goals and priorities for the data pipeline.

    • Allocate resources to the data team.

    • Track the progress of the data pipeline.

    • Communicate the status of the data pipeline to stakeholders.

  • Consultant: Brought in to provide expertise on specific aspects of the data pipeline, such as data collection, cleaning, or analysis.

    • Provide advice on the best practices for collecting, cleaning, and analyzing data.

    • Help to troubleshoot problems with the data pipeline.

    • Train the data team on new technologies or methods.

  • CEO: Ultimately responsible for the data pipeline and its impact on the organization. May not be directly involved in the day-to-day operations of the data team, but should have a high-level understanding of the data pipeline and its importance.

    • Set the overall vision for the data pipeline.

    • Ensure that the data pipeline is aligned with the strategic goals of the organization.

    • Make decisions about the budget and staffing of the data team.

    • Communicate the importance of data to the rest of the organization.

Summary#

https://github.com/khliland/IND320/blob/main/D2Dbook/images/Data_pipeline.png?raw=TRUE
  • A form of summary is shown above.

  • This aligns well with Data-Driven Decision Modelling.

  • The end result may be a dashboard or report for monitoring, exploration, documentation, etc.

Exercise/discussion

  • Which obstacles/hurdles/barriers could you envision for the different steps in the pipeline?

  • … and for the pipeline as a whole?