Kedro Review

In the last months I decided to experiment with kedro. In this post I would like to share my personal experience with it, together the things that worked well and things that worked less well. I will not cover kedro fundamentals as this has been already covered in the official documentation and in hundreds of other posts. So, I would suggest to check that first if you have not done it before.

Positive points

Pipelines, nodes and CLI

When using Kedro, one should stick to the pipeline and nodes API. This is a very good thing because it encourages to write more more standardized, reusable and testable functions. A node is basically a wrapper around a function. The node API forces the function to take some inputs and to return some outputs.

A Kedro pipeline is what links a set of nodes together into a process, each node of a pipeline represents a step in a process.

Cutting your code into nodes and pipelines allows for runtime optimizations and for automatic creations of a sleek command line interface (CLI). The runtime optimization are achieved in Kedro by running independent nodes in parallel using the parallel runner. The command line interface of Kedro allows to easily select which nodes of a pipeline to run. Extremely useful during development.

This feature is the heart of Kedro and the main reason why I have decided to try this library.

Data Catalog

Example of data catalog:

# catalog.yaml

bikes: # In your pipeline, you will refer to this dataset as 'bikes'
  type: pandas.CSVDataSet
  filepath: data/01_raw/bikes.csv

boats:
  # This file allows to abstract the details on how the different
  # datasets should be saved and loaded
  type: pandas.CSVDataSet
  filepath: data/01_raw/company/boats.csv.gz
  load_args:
    sep: ','
    compression: gzip
  fs_args:
    open_args_load:
      mode: rb

Kedro offers a very interesting framework for managing data via the Data Catalog. In the catalog we specify all of the data sources used by the project including the outputs and intermediate data. The Data Catalog is great to abstract away details like data location and the data formats, so that you in your code can only focus on what matters.

The Data Catalog seamlessly integrates with Kedro pipelines and creates a nice separation of concerns between storage (the catalog) and business logic (pipeline and nodes).

I think that the data catalog is one of the killer features of Kedro.

The data engineering convention

data
├── 01_raw            <-- Raw immutable data
├── 02_intermediate   <-- Typed data
├── 03_primary        <-- Domain model data
├── 04_feature        <-- Model features
├── 05_model_input    <-- Often called 'master tables'
├── 06_models         <-- Serialised models
├── 07_model_output   <-- Data generated by model runs
└── 08_reporting      <-- Ad hoc descriptive cuts

The Data Engienering Convention is a way of organizing your datasets into different folders. It is not necessarily kedro-specific, but it is something that I have learned thanks to this project. I personally found it very useful and I will definitely use it in other projects, even without using kedro eventually.

Pipelines Visualizations

Picture taken from the kedro official documentation

It is nice to show something nice that you have done to your colleagues and to your business stakeholders. kedro-viz is a kedro plug-in that allows you to do that (and much more). While visualizing a pipeline, it is possible to interact with the pipeline graph and get details about the nodes inputs and outputs. This is a very helpful feature for documenting your code and for better understanding what the pipelines of your colleagues are doing.

Negative Points

Too broad scope

By comparing Kedro with other python frameworks I have worked with outside of the Data Science domain, I cannot ignore how this library has a too broad scope. When looking at the documentation I see that kedro wants to impose its standards on its users. This comes in the form of:

Repository Structure (via cookiecutter)
Dependency Management (via pip-tools)

In my humble opinion, this is a bit too much for a single package. I understand that those functionalities are kind of optional, however this is not the tone that I get from the official documentation.

I am sure that people at QuantumBlacks had noble intentions when they defined the scope of Kedro. After all, having such broad scope has the advantage of standardizing every aspect of model development. Moreover it makes life easier for data scientists with a non-IT background since everything is already decided and documented for them.

In my opinion, having such a broad scope is the root cause of other negative points of Kedro.

Repository Structure

I am not a huge fan of the design decisions that have been taken in Kedro when it comes to the recommended repository structure. I feel like the suggested structure is a bit too nested and not very pythonic. For example, why project requirements are not at the highest level but they are nested inside the src folder?

I do understand that this topic is very subjective and use-case specific. Moreover, if you do not like the default repository structure, you can always change it.

Another negative points for kedro is that the command to create a repository is a kedro command (kedro new) but kedro uses cookiecutter to create the repo under the hood. I would have preferred the kedro documentation to recommend a cookiecutter command for creating a repo, in this way I could have had access to all of the options of cookiecutter from command line.

Dependency Management

The kedro command for locking requirements is kedro build-reqs. This command will call pip-tools under the hood . I would have preferred the kedro documentation to recommend me on how to use kedro with pip-tools and eventually with other dependency management tools like pip or poetry.

Adoption

Kedro is a relatively jung package compared to other packages like scikit-learn and pandas. While the latter two became pretty much the golden standard for data science and literally everyone in the sector knows about them, the adoption of Kedro is still quite limited. This means that it is going to be harder for starters to find solutions to common problems on stackoverflow.

This is not a too big of a deal, since the community behind Kedro is very active and supportive. Moreover, the Kedro documentation is quite extensive and detailed.

Conclusions

I really enjoyed working with Kedro. I had to use some quirky hacks to adapt it to my use-case, but after all, I was quite satisfied with the solution. I genuinely believe that the benefits outweigh the negative aspects of Kedro. It is a great package that could help newbies in further professionalizing their code and more seasoned professionals in automating and standardizing some parts of their job. I will definitely continue using it in my next project.

Keyboard Shortcuts Ninja Cheat Sheet

Python dependency management workflow using standard tools