ML Pipeline: Structuring Data Analysis Projects
In the 21st century, the era or “rock stars” in software development has come to an end. Now almost every good project is backed by a team of developers relying on the best practices evolved over the years of IT sector growth. These practices, such as using a single repository, storing a codebase under a version control system, code coverage with tests, defining code style conventions, help good engineers to collaborate effectively and create high-quality products.
Unfortunately, it is not so easy to adopt the best practices of software development for data analysis. To do so, we need to find the tools and approaches that will take into account the special aspects of ML projects: working with big datasets, numerous pipelines and a huge number of models with many hyperparameters.
I will tell you how to facilitate the interaction between data scientists, speed up and standardize the process of conducting experiments and achieve a reproducibility of the results of these experiments.
After my talk, you will be able to:
Create a well-structured data analysis project;
Control the quality of the code inside this project;
Track the results of experiments conducted on different machines;
Automate the selection of hyperparameters;
Version data and pipelines;
Reliably reproduce experiments.
St. Petersburg, Russia
I earned a master’s degree in Data Analysis after graduating from Higher School of Economics in 2018. Worked as a freelance Python-dev and data scientist for 1.5 years. Now I’m a data scientist at SEMrush. I have some experience in teaching and public talks: I taught machine learning courses at Digital Banana and gave a talk at an international conference “Big Data Days 2019”. I am really interested in the problem of experiments’ reproducibility in ML. In my free time I like to commit to open source projects.