scikit-learn source code review guide


  © 2024 borui. All rights reserved.

  This content may be freely reproduced, displayed, modified, or distributed with proper attribution to borui and a link to the article: 
  
  borui(2024-03-17 03:07:26 +0000). scikit-learn source code review guide. https://borui/blog/2024-03-17-en-sklearn-source-code-review.

@misc{
  borui2024,
  author = {borui},
  title = {scikit-learn source code review guide},
  year = {2024},
  publisher = {borui's blog},
  journal = {borui's blog},
  url={https://borui/blog/2024-03-17-en-sklearn-source-code-review}
}

some words ahead - cython basics

cython file types

.pyx: it is the source file contains implementation, same as .py.

.pxi: it can be used for implementation and declaration, now deprecated.

.pxd: it is used for decalaration. When accompanying an equally named .pyx / .py file, they provide a Cython interface to the Cython module so that other Cython modules can communicate with it using a more efficient protocol than the Python one. However, it may include implementation as inline. Example:

cdef inline int int_min(int a, int b):
    return b if b < a else a

Difference between .pxi and .pxd:

LONG Answer: A .pxd file is a declaration file, and is used to declare classes, methods, etc. in a C extension module, (typically as implemented in a .pyx file of the same name). It can contain declarations only, i.e. no executable statements. One can cimport things from .pxd files just as one would import things in Python. Two separate modules cimporting from the same .pxd file will receive identical objects.

A .pxi file is an include file and is textually included (similar to the C #include directive) and may contain any valid Cython code at the given point in the program. It may contain implementations (e.g. common cdef inline functions) which will be copied into both files. For example, this means that if I have a class A declared in a.pxi, and both b.pyx and c.pyx do include a.pxi then I will have two distinct classes b.A and c.A. Interfaces to C libraries (including the Python/C API) have usually been declared in .pxi files (as they are not associated to a specific module). It is also re-parsed at every invocation.

Now that cimport * can be used, there is no reason to use .pxi files for external declarations.

.pyd, .so : Output from compilation of c or cpp extensions in python.

# setup.py
from distutils.core import setup, Extension

module1 = Extension('demo',
                    sources = ['demo.c'])

setup (name = 'PackageName',
       version = '1.0',
       description = 'This is a demo package',
       ext_modules = [module1])

With this setup.py, and a file demo.c, running python setup.py build will compile demo.c, and produce an extension module named demo in the build directory. Depending on the system, the module file will end up in a subdirectory build/lib.system, and may have a name like demo.so or demo.pyd.

reference

Cython 3.1.0a0 documentation » Tutorials » pxd files (n.d.). cython documentation. Retrieved April 3, 2024, from https://cython.readthedocs.io/en/latest/src/tutorial/pxd_files.html
Cython 3.1.0a0 documentation » Users Guide » FAQ (n.d.). cython documentation. Retrieved April 3, 2024, from https://cython.readthedocs.io/en/latest/src/userguide/faq.html#what-is-the-difference-between-a-pxd-and-pxi-file-when-should-either-be-used https://cython.readthedocs.io/en/latest/src/userguide/faq.html#what-is-better-a-single-big-module-or-multiple-separate-modules
3.10.13 Documentation » Extending and Embedding the Python Interpreter » 4. Building C and C++ Extensions(n.d.). python documentation. Retrieved April 3, 2024, from https://docs.python.org/3.10/extending/building.html#building-c-and-c-extensions-with-distutils

additional reading

rrr2. (31 Mar, 2019). Cython 的 .pyx .pyd 文件格式. CSDN. [Blog post]. Retrieved April 3, 2024, from https://blog.csdn.net/qq_35608277/article/details/88937904

cython keywords def and cdef

The key difference between def and cdef is in where the function can be called from: def functions can be called from Python and Cython while cdef function can be called from Cython and C.

cpdef function: cpdef functions cause Cython to generate a cdef function (that allows a quick function call from Cython) and a def function (which allows you to call it from Python). Interally the def function just calls the cdef function. In terms of the types of arguments allowed, cpdef functions have all the restrictions of both cdef and def functions.

reference

DavidW. (Feb 1, 2017). Restrictions on cdef functions: \ncdef functions cannot be defined inside other functions - this is because there is no way of storing any captured variables in a C function pointer. E.g. the following code is illegal: def g(a): \ncdef (int b): \nreturn a+b \ncdef functions cannot take *args and **kwargs type arguments. This is because they cannot easily be translated into a C signature. \nAdvantages of cdef functions \ncdef functions can take any type of argument, including those that have no Python equivalent (for example pointers). def functions cannot have these, since they must be callable from Python. \ncdef functions can also specify a return type (if it is not specified then they return a Python object, PyObject* in C). def functions always return a \nPython object, so cannot specify a return type: \ncdef int h(int* a):\n # specify a return type and take a non-Python compatible argument\n return a[0] \ncdef functions are quicker to call than def functions because they translate to a simple C function call.. [Answer]. stackoverflow. https://stackoverflow.com/questions/28362009/definition-of-def-cdef-and-cpdef-in-cython/41976772#41976772

meson.build - Meson Build system

Similar to CMake, Meson does not build software directly, but uses an appropriate backend, using ninja on GNU/Linux, Visual Studio on Windows, and Xcode on MacOS.

Example:

# scikit-learn/sklearn/ensemble/meson.build
py.extension_module(
  '_gradient_boosting',
  ['_gradient_boosting.pyx'] + utils_cython_tree,
  dependencies: [np_dep],
  cython_args: cython_args,
  subdir: 'sklearn/ensemble',
  install: true
)

subdir('_hist_gradient_boosting')

subdir() is a hint for the building system for next directory to go inito and build

addtional reading

QQVQQ.... (24 April, 2023.). Meson构建系统的使用. CSDN. [Blog post]. Retrieved April 3, 2024, from https://blog.csdn.net/hfy1237/article/details/130261785
espresso_yu. (20 Sept, 2020). Meson构建系统（一）. CSDN. [Blog post]. Retrieved April 3, 2024, from https://blog.csdn.net/u010074726/article/details/108695256

oop concept mixin

Mxins are heavily used in scikit-learn.

A mixin is a special kind of multiple inheritance. There are two main situations where mixins are used:

You want to provide a lot of optional features for a class.
You want to use one particular feature in a lot of different classes.

The invention of Mixin is to solve the chaos came with multiple inheritance.

reference

Jason Baker. (Feb 13, 2009). A mixin is a special kind of multiple inheritance. There are two main situations where mixins are used:You want to provide a lot of optional features for a class.You want to use one particular feature in a lot of different classes.. [Answer]. stackoverflow. https://stackoverflow.com/questions/533631/what-is-a-mixin-and-why-is-it-useful/547714#547714
Wealong. (July 22, 2015). 如楼上很多答主一样，谈到Mixin就不得不谈到多重继承，因为Mixin的出现就是为了解决多重继承的问题，那么多重继承有什么问题呢？在《松本行弘的程序世界》一书中，作者列举了以下三点：结构复杂化：如果是单一继承，一个类的父类是什么，父类的父类是什么，都很明确，因为只有单一的继承关系，然而如果是多重继承的话，一个类有多个父类，这些父类又有自己的父类，那么类之间的关系就很复杂了。优先顺序模糊：假如我有A，C类同时继承了基类，B类继承了A类，然后D类又同时继承了B和C类，所以D类继承父类的方法的顺序应该是D、B、A、C还是D、B、C、A，或者是其他的顺序，很不明确。功能冲突：因为多重继承有多个父类，所以当不同的父类中有相同的方法是就会产生冲突。如果B类和C类同时又有相同的方法时，D继承的是哪个方法就不明确了，因为存在两种可能性。当然你可以说有些语言解决了这个问题，但是并不是所有语言都想要去纠结这个问题。所以为能够利用多继承的优点又解决多继承的问题，提出了规格继承和实现继承这两样东西。简单来讲，规格继承指的是一堆方法名的集合，而实现继承除了方法名还允许有方法的实现。Java 选择了规格继承，在 Java 中叫 interface（不过Java8中已经有默认方法了），而 Ruby 选择了实现继承，也可以叫Mixin，在 Ruby 中叫 module。从某种程度上来说，继承强调 I am，Mixin 强调 I can。当你 implement 了这个接口或者 include 这个 module 的时候，然后就你行你上。. [Answer]. zhihu. https://www.zhihu.com/question/20778853/answer/55983839
Anonymous User. (Mar 5, 2022). Mixin的规矩当我们用SomeMixin去命名一个类并进行合理实现时，一般意味着：不认为它是个基类（但在语言层面它就是一个可被继承的基类），因此不会在其中去写属于实际类内涵的属性和方法；纯粹的Mixin不需要依赖基类的属性或方法 —— 椰果不是一种奶茶它能独立地实现一些功能，通常只写方法和方法必须的内部变量，Mixin内部的方法间往往有相互调用的关系，形成一个联系紧密的功能集合 —— 椰果可以单独吃写实际类时继承这个Mixin，就会获得它的功能；实际类不覆盖所继承Mixin的属性和方法，因此不需要也不会调用super()函数去取Mixin中的方法 —— 倒进去了，杯子里就有椰果，椰果也还是椰果可能被同时继承的Mixin之间没有重复的方法或属性，因此不用关心继承的顺序 —— 椰果是椰果、牛奶是牛奶多重继承基类和Mixin构成我们想要的类——加好各种原料奶茶就可以喝了（注意：比喻可以帮助理解，但并不精确）. [Answer]. zhihu. https://www.zhihu.com/question/20778853/answer/2374758411

general project structure

base classes

The sklearn project can be seen as a big tree, with various estimators as fruits, and the trunks that supports these estimators are a few base classes. Several common classes include BaseEstimator, BaseSGD, ClassifierMixin, RegressorMixin, etc.

BaseEstimator

The lowest level is the BaseEstimator class. Mainly exposes two methods: set_params, get_params.

ClassifierMixin

Mixin means mixing in a class, which can be simply understood as adding some additional methods to other classes. Sklearn's classification and regression mixed classes only implement score, and any class that inherits them needs to implement other methods such as fit and predict by yourself.

reference

chandlertu. (2020, Mar 23). Scikit-Learn 源码研读（第二期）基类的实现细节 [Blog post]. Retrieved from https://www.cnblogs.com/learn-the-hard-way/p/12532888.html#baseestimator

tree and ensemble

tree cython & python

_tree.pyx is the main structure of the tree model, responsible for the level generation of the tree, and is encapsulated by the Cython class Tree. There are two methods for generating tree nodes: the DepthFirstTreeBuilder class and the BestFirstTreeBuilder class. They all inherit from the TreeBuilder class, in fact, in order to inherit its _check_input method, check the input type, and convert it into a continuous memory storage form ( the purpose is to speed up calculation and indexing ) such as np.asfortranarray and np.ascontiguousarray format.

sklearn/tree/tree.py & sklearn/base.py: BaseDecisionTree inherits from the BaseEstimator base class, which provides basic operation interfaces for all predictors, such as get_params operations. The ClassifierMixin/RegressorMixin base class provides the score interface.

ada boost

weight_boosting.py The code implementation of AdaBoost has only two interfaces, AdaBoostClassifier and AdaBoostRegressor, both of which are extensions of the basic AdaBoost, allowing it to perform multi-classification and regression; both of them inherit from BaseWeightBoosting and BaseEnsemble. It is worth noting that Adaboost needs to ensure that the base learner has an accuracy greater than 50% (two-class classification). If it cannot be guaranteed, its theoretical derivation will be wrong, so the accuracy is directly limited in the source code.

gdbt

Basic calling sequence is like: fit()->fit_stages()->fit_stage, interfaces with details omitted:

def fit(self, X, y, sample_weight=None, monitor=None):
    """Fit the gradient boosting model.    
    Parameters
    ----------
    X : {array-like, sparse matrix} of shape (n_samples, n_features)
        The input samples. Internally, it will be converted to
        ``dtype=np.float32`` and if a sparse matrix is provided
        to a sparse ``csr_matrix``.

    y : array-like of shape (n_samples,)
        Target values (strings or integers in classification, real numbers
        in regression)
        For classification, labels must correspond to classes.

    sample_weight : array-like of shape (n_samples,), default=None
        Sample weights. If None, then samples are equally weighted. Splits
        that would create child nodes with net zero or negative weight are
        ignored while searching for a split in each node. In the case of
        classification, splits are also ignored if they would result in any
        single class carrying a negative weight in either child node.

    monitor : callable, default=None
        The monitor is called after each iteration with the current
        iteration, a reference to the estimator and the local variables of
        ``_fit_stages`` as keyword arguments ``callable(i, self,
        locals())``. If the callable returns ``True`` the fitting procedure
        is stopped. The monitor can be used for various things such as
        computing held-out estimates, early stopping, model introspect, and
        snapshotting.

    Returns
    -------
    self : object
        Fitted estimator.
    """

    self.fit_stages()
    
def _fit_stages(
    self,
    X,
    y,
    raw_predictions,
    sample_weight,
    random_state,
    X_val,
    y_val,
    sample_weight_val,
    begin_at_stage=0,
    monitor=None,
):
    """Iteratively fits the stages.

    For each stage it computes the progress (OOB, train score)
    and delegates to ``_fit_stage``.
    Returns the number of stages fit; might differ from ``n_estimators``
    due to early stopping.
    """
    for i in range(begin_at_stage, self.n_estimators):
      # fit next stage of trees
      raw_predictions = self._fit_stage()

def _fit_stage(
    self,
    i,
    X,
    y,
    raw_predictions,
    sample_weight,
    sample_mask,
    random_state,
    X_csc=None,
    X_csr=None,
):
    """Fit another stage of ``n_trees_per_iteration_`` trees."""

reference

Rrui_739. (May 30, 2019). sklearn源码解读：1.10 Decision Trees & 1.11 Ensemble methods. CSDN. [Blog post]. Retrieved April 3, 2024, from https://blog.csdn.net/Rrui7739/article/details/90669812
scikit-learn/sklearn/ensemble/_gb.py - BaseGradientBoosting._fit_stage. branch main. commit b38b7f3. https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/ensemble/_gb.py