This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the memory argument.
When the memory argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted. This can be
time-consuming if the transformers are expensive to compute or if the dataset is large.
However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to None. This way the
intention is clear.
Specify the memory argument when creating a Scikit-Learn Pipeline.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
]) # Noncompliant: the memory parameter is not provided
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
], memory="cache_folder") # Compliant
If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space. This is true when using
sklearn.model_selection.HalvingGridSearchCV or sklearn.model_selection.HalvingRandomSearchCV because the size of the dataset
changes every iteration when using the default configuration.