Project Description
In this project you will implement, evaluate, operate, monitor, and optimize an end-to-end ML pipeline. The details of four kinds of pipelines: datasets, model choices, metrics are provided to you. The specific pipeline that you wish to implement is your choice.
The focus of the project is to implement the pipeline choices, evaluate which configuration of the pipeline is best and under what data distributions, operationalize the pipelines on the cloud, monitor their real-time behavior, and optimize the choice of a pipeline configuration.
The project entails many concerns, including deployment, scaling, reliability, drift and feedback loops. It also has three milestones including a final project presentation.
Teamwork: You will work on this project in your assigned teams. As a team, you will use a shared GitHub repository and a virtual machine to coordinate your work. Please establish a way of communication and collaboration that works for your team -- for example, a Teams channel. Please agree on tasks and responsibilities. We expect that team members will have different backgrounds and different amounts of experience with machine learning and engineering -- this is intentional. Use the first milestone to identify strength and weaknesses, fill missing gaps in background knowledge, and teach each other about tools and practices that are effective for this task. Finally, be a good team citizen.
Milestones: For each milestone and the final presentation, there are separate deliverables, described below. The milestones are checkpoints to ensure that certain functionality has been delivered.
Milestones are graded on a pass/fail scheme for the criteria listed with each milestone. A fail turns into a pass if it is achieved by the end of the next milestone. All milestones must be delivered by the final due date. Note, Milestones build on each other. We recommend to look at all milestones before starting the project, as you may make different design decision and avoid extensive rework.
Pipelines: The schematic for pipelines is here. Each pipeline has input datasets and variety of models for each stage. The SLA for each pipeline is provided at the end of the schematic.
Infrastructure: We provide Dockerfiles for each pipeline use case. The Dockerfile specifies datasets and models to download. You should build the Dockerfile into a container and peruse the data and model files. Further details on how to use the Dockerfile are present in the projects directory.
You may request more computing resources from the course staff if the virtual machine’s resources are insufficient -- we may or may not be able to accommodate such requests, it may take a few days, and will require a system reboot.
Data You do not need to use all provided data. Identify what data is relevant.
Shankar add more details once the data are identified.
Languages, tools, and frameworks: Your team is free to chose any technology stack for any part of this project. You have root access to your virtual machine and are free to install any software you deem suitable. You also may use external data and services (e.g. cloud services) as long as you can make them also accessible to the course staff. For example, you can use the free cloud credits that companies like Microsoft, Google, and AWS provide to students for this project. Whenever you set up tools or services, pay some attention to configuring them with reasonable security measures; past student solutions have been actively exploited in past projects and this can lead to data loss or loss of internet access for your virtual machine.
Documentation For all milestones, we ask you to discuss some aspects of your design decision and implementation. It may be a good idea to write general documentation that is useful for the team in a place that is shared and accessible to the team (e.g., README.md or wiki pages on GitHub). Conversely it may be a good idea to include text or figures you write for reports as part of the project documentation. Feel free to link to more detailed documentation from your report or simply copy material from existing documentation into the report.
In general, we do not much care about the format or location of where we can find parts of your answers, such as code and screenshots, as long as we can find it. Please make an effort to be clear where to find content if it is not copied directly into the report, preferably with direct links to individual files or even lines
The primary task is to set up multiple pipeline configurations, where in each pipeline configuration provides some result to compare pipelines. Note we do not care about the accuracy of fidelity of the result at this point. At this point we care whether a pipe is functional or not. For example, in ID card processing, one pipeline configuration is [red channel, Tesseract, Classification-level-2]. Define the accuracy of this pipeline and compute it.
T1. Meet with your team members. Decide how you divide the work (minimum: who is going to do what and by when). Share team skills and responsibilities.
T2. Inform which pipeline use case the team is using.
T3. Setup pipeline configurations in a Jupyter notebook. This is a complex task, which will require you to connect the input datasets to the first step of the pipeline, the first step to the second step, and so on, and finally output a basic accuracy number for the pipeline. We have provided one accuracy metric, which can be used as an assertion. It is upto you to think of other improved accuracy metrics and provide justification for their use. Once you have one single pipeline working, experiment by changing models at each step and set up as many pipeline configurations as possible.
Run each pipeline and obtain pipeline latency and compare with the provided SLA. You should maintain a fixed batch size of data across pipelines.
T4. Practice teamwork and reflect on the process.
- Submit your Notebook which consists of code for pipeline configurations to GitHub. Provide a Dockerfile of your container including the notebook on your Team GitHub and share with us for grading.
- Write and submit a short report that describes the pipeline objective and architecture. Architecture includes description of datasets, models, and assertions used, and the resulting pipeline configurations.
- Mention how many choices of datasets and models were used and the number of resulting pipeline configurations that are possible in your pipeline.
- For each pipeline configuration, report the accuracy (in terms of number of assertions), latency, and whether SLA was met, as a table.
- Ensure that your notebook is well-documented starting with the name of the pipeline used, and report is detailed and understandable.
- Briefly describe in your report how your team organizes itself. What communication channels do you use? How have you divides the work? Did you encountered any teamwork problems and what steps are you planning to take in future milestones to avoid them?
This milestone is worth 10 points. 3 points for producing a notebook, 3 points for sharing a Dockerfile that is downloadable, runnable into a container, and regenerating the results reported in the document. 4 points for the report that is clear, complete, and precise.
The primary task in this milestone is to test specific changes in pipeline configurations and determine and analyze how those changes will affect accuracy and performance measures. We will test model versions and measure resource consumption and compare with accuracy.
T1. Break your observed latency for a pipeline configuration into per step latency. Observe which model inference step is most time consuming.
T2. For the given step, and the given model try out different model version on different number of CPU cores. Given your step SLA, note which version and core combination results in best achieving your step SLA. Try at least three different model versions and three different CPU cores. Keep the batch size fixed.
T3. For the same step, and a given model version and CPU cores, experiment if batching multiple requests changes the utilization of assigned resources. Try at least three different batch sizes.
T4. Conduct the same set of steps for yet another step of the pipeline.
- Submit notebook code with the specific variants. The grader should be able to generate data in your tables and plots.
- Create a plot comparing model version, accuracy and latency, and another plot comparing model version, accuracy and throughput.
- Draw a table which reports Variant, Batch Size, Latency, Cost, Accuracy across the different steps of your pipeline.
- Report your analysis: which model version will you choose, at what batch size, and at how many resources. Note, do account for overall SLA and accuracy measures when making this analysis.
- Project presentation. You should be able to present your work till Milestone 2 as part of class presentation.
This milestone is worth 10 points. 3 points for producing the notebook, 3 points for plot and table generation. 4 points for the report and presentation that describes analysis and is clear, complete, and precise.
In this milestone we will set a realistic pipeline. The goal will be to use available open-source libraries and ML systems for designing an ML system pipeline.
T1. Set up a model loader from which different models of each step and their individual variants, can be stored and loaded when needed. You may use MinIO object storage. Models must be stored and loaded in container form.
T2. Set up each model container on a cloud node by itself. Deploy containers using Kubernetes.
T3. Equip the pipeline with a monitoring system such as Prometheus.
- The modified Dockerfile now consisting of operational parts of the pipeline.
- A report describing in detail how the pipeline is operationalized; if you use any additional system beyond the current one specified.