scikit-learn source code review guide

March 17, 2024

some words ahead - cython basics

cython file types

.pyx: it is the source file contains implementation, same as .py.

.pxi: it can be used for implementation and declaration, now deprecated.

.pxd: it is used for decalaration. When accompanying an equally named .pyx / .py file, they provide a Cython interface to the Cython module so that other Cython modules can communicate with it using a more efficient protocol than the Python one. However, it may include implementation as inline. Example:

cdef inline int int_min(int a, int b):
    return b if b < a else a

Difference between .pxi and .pxd:

LONG Answer: A .pxd file is a declaration file, and is used to declare classes, methods, etc. in a C extension module, (typically as implemented in a .pyx file of the same name). It can contain declarations only, i.e. no executable statements. One can cimport things from .pxd files just as one would import things in Python. Two separate modules cimporting from the same .pxd file will receive identical objects.

A .pxi file is an include file and is textually included (similar to the C #include directive) and may contain any valid Cython code at the given point in the program. It may contain implementations (e.g. common cdef inline functions) which will be copied into both files. For example, this means that if I have a class A declared in a.pxi, and both b.pyx and c.pyx do include a.pxi then I will have two distinct classes b.A and c.A. Interfaces to C libraries (including the Python/C API) have usually been declared in .pxi files (as they are not associated to a specific module). It is also re-parsed at every invocation.

Now that cimport * can be used, there is no reason to use .pxi files for external declarations.

.pyd, .so : Output from compilation of c or cpp extensions in python.

from distutils.core import setup, Extension

module1 = Extension('demo',
                    sources = ['demo.c'])

setup (name = 'PackageName',
       version = '1.0',
       description = 'This is a demo package',
       ext_modules = [module1])

With this, and a file demo.c, running python build will compile demo.c, and produce an extension module named demo in the build directory. Depending on the system, the module file will end up in a subdirectory build/lib.system, and may have a name like or demo.pyd.


additional reading

cython keywords def and cdef

The key difference between def and cdef is in where the function can be called from: def functions can be called from Python and Cython while cdef function can be called from Cython and C.

cpdef function: cpdef functions cause Cython to generate a cdef function (that allows a quick function call from Cython) and a def function (which allows you to call it from Python). Interally the def function just calls the cdef function. In terms of the types of arguments allowed, cpdef functions have all the restrictions of both cdef and def functions.


Similar to CMake, Meson does not build software directly, but uses an appropriate backend, using ninja on GNU/Linux, Visual Studio on Windows, and Xcode on MacOS.


# scikit-learn/sklearn/ensemble/
  ['_gradient_boosting.pyx'] + utils_cython_tree,
  dependencies: [np_dep],
  cython_args: cython_args,
  subdir: 'sklearn/ensemble',
  install: true


subdir() is a hint for the building system for next directory to go inito and build

addtional reading

oop concept mixin

Mxins are heavily used in scikit-learn.

A mixin is a special kind of multiple inheritance. There are two main situations where mixins are used:

  1. You want to provide a lot of optional features for a class.
  2. You want to use one particular feature in a lot of different classes.

The invention of Mixin is to solve the chaos came with multiple inheritance.


general project structure

base classes

The sklearn project can be seen as a big tree, with various estimators as fruits, and the trunks that supports these estimators are a few base classes. Several common classes include BaseEstimator, BaseSGD, ClassifierMixin, RegressorMixin, etc.


The lowest level is the BaseEstimator class. Mainly exposes two methods: set_params, get_params.


Mixin means mixing in a class, which can be simply understood as adding some additional methods to other classes. Sklearn's classification and regression mixed classes only implement score, and any class that inherits them needs to implement other methods such as fit and predict by yourself.


tree and ensemble

tree cython & python

_tree.pyx is the main structure of the tree model, responsible for the level generation of the tree, and is encapsulated by the Cython class Tree. There are two methods for generating tree nodes: the DepthFirstTreeBuilder class and the BestFirstTreeBuilder class. They all inherit from the TreeBuilder class, in fact, in order to inherit its _check_input method, check the input type, and convert it into a continuous memory storage form ( the purpose is to speed up calculation and indexing ) such as np.asfortranarray and np.ascontiguousarray format.

sklearn/tree/ & sklearn/ BaseDecisionTree inherits from the BaseEstimator base class, which provides basic operation interfaces for all predictors, such as get_params operations. The ClassifierMixin/RegressorMixin base class provides the score interface.

ada boost The code implementation of AdaBoost has only two interfaces, AdaBoostClassifier and AdaBoostRegressor, both of which are extensions of the basic AdaBoost, allowing it to perform multi-classification and regression; both of them inherit from BaseWeightBoosting and BaseEnsemble. It is worth noting that Adaboost needs to ensure that the base learner has an accuracy greater than 50% (two-class classification). If it cannot be guaranteed, its theoretical derivation will be wrong, so the accuracy is directly limited in the source code.


Basic calling sequence is like: fit()->fit_stages()->fit_stage, interfaces with details omitted:

def fit(self, X, y, sample_weight=None, monitor=None):
    """Fit the gradient boosting model.    
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples. Internally, it will be converted to
        ``dtype=np.float32`` and if a sparse matrix is provided
        to a sparse ``csr_matrix``.

    y : array-like of shape (n_samples,)
        Target values (strings or integers in classification, real numbers
        in regression)
        For classification, labels must correspond to classes.

    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights. If None, then samples are equally weighted. Splits
        that would create child nodes with net zero or negative weight are
        ignored while searching for a split in each node. In the case of
        classification, splits are also ignored if they would result in any
        single class carrying a negative weight in either child node.

    monitor : callable, default=None
        The monitor is called after each iteration with the current
        iteration, a reference to the estimator and the local variables of
        ``_fit_stages`` as keyword arguments ``callable(i, self,
        locals())``. If the callable returns ``True`` the fitting procedure
        is stopped. The monitor can be used for various things such as
        computing held-out estimates, early stopping, model introspect, and

    self : object
        Fitted estimator.

def _fit_stages(
    """Iteratively fits the stages.

    For each stage it computes the progress (OOB, train score)
    and delegates to ``_fit_stage``.
    Returns the number of stages fit; might differ from ``n_estimators``
    due to early stopping.
    for i in range(begin_at_stage, self.n_estimators):
      # fit next stage of trees
      raw_predictions = self._fit_stage()

def _fit_stage(
    """Fit another stage of ``n_trees_per_iteration_`` trees."""


