4.1. Pipeline and FeatureUnion: combining estimators¶
4.1.1. Pipeline: chaining estimators¶
Pipeline
can be used to chain multiple estimators
into one. This is useful as there is often a fixed sequence
of steps in processing the data, for example feature selection, normalization
and classification. Pipeline
serves two purposes here:
Convenience: You only have to call
fit
andpredict
once on your data to fit a whole sequence of estimators.Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once.
All estimators in a pipeline, except the last one, must be transformers
(i.e. must have a transform
method).
The last estimator may be any type (transformer, classifier, etc.).
4.1.1.1. Usage¶
The Pipeline
is build using a list of (key, value)
pairs, where
the key
a string containing the name you want to give this step and value
is an estimator object:
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import SVC
>>> from sklearn.decomposition import PCA
>>> estimators = [('reduce_dim', PCA()), ('svm', SVC())]
>>> clf = Pipeline(estimators)
>>> clf
Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None,
whiten=False)), ('svm', SVC(C=1.0, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
The utility function make_pipeline
is a shorthand
for constructing pipelines;
it takes a variable number of estimators and returns a pipeline,
filling in the names automatically:
>>> from sklearn.pipeline import make_pipeline
>>> from sklearn.naive_bayes import MultinomialNB
>>> from sklearn.preprocessing import Binarizer
>>> make_pipeline(Binarizer(), MultinomialNB())
Pipeline(steps=[('binarizer', Binarizer(copy=True, threshold=0.0)),
('multinomialnb', MultinomialNB(alpha=1.0,
class_prior=None,
fit_prior=True))])
The estimators of a pipeline are stored as a list in the steps
attribute:
>>> clf.steps[0]
('reduce_dim', PCA(copy=True, n_components=None, whiten=False))
and as a dict
in named_steps
:
>>> clf.named_steps['reduce_dim']
PCA(copy=True, n_components=None, whiten=False)
Parameters of the estimators in the pipeline can be accessed using the
<estimator>__<parameter>
syntax:
>>> clf.set_params(svm__C=10)
Pipeline(steps=[('reduce_dim', PCA(copy=True, n_components=None,
whiten=False)), ('svm', SVC(C=10, cache_size=200, class_weight=None,
coef0=0.0, decision_function_shape=None, degree=3, gamma='auto',
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False))])
This is particularly important for doing grid searches:
>>> from sklearn.grid_search import GridSearchCV
>>> params = dict(reduce_dim__n_components=[2, 5, 10],
... svm__C=[0.1, 10, 100])
>>> grid_search = GridSearchCV(clf, param_grid=params)
Examples:
4.1.1.2. Notes¶
Calling fit
on the pipeline is the same as calling fit
on
each estimator in turn, transform
the input and pass it on to the next step.
The pipeline has all the methods that the last estimator in the pipeline has,
i.e. if the last estimator is a classifier, the Pipeline
can be used
as a classifier. If the last estimator is a transformer, again, so is the
pipeline.
4.1.2. FeatureUnion: composite feature spaces¶
FeatureUnion
combines several transformer objects into a new
transformer that combines their output. A FeatureUnion
takes
a list of transformer objects. During fitting, each of these
is fit to the data independently. For transforming data, the
transformers are applied in parallel, and the sample vectors they output
are concatenated end-to-end into larger vectors.
FeatureUnion
serves the same purposes as Pipeline
-
convenience and joint parameter estimation and validation.
FeatureUnion
and Pipeline
can be combined to
create complex models.
(A FeatureUnion
has no way of checking whether two transformers
might produce identical features. It only produces a union when the
feature sets are disjoint, and making sure they are is the caller’s
responsibility.)
4.1.2.1. Usage¶
A FeatureUnion
is built using a list of (key, value)
pairs,
where the key
is the name you want to give to a given transformation
(an arbitrary string; it only serves as an identifier)
and value
is an estimator object:
>>> from sklearn.pipeline import FeatureUnion
>>> from sklearn.decomposition import PCA
>>> from sklearn.decomposition import KernelPCA
>>> estimators = [('linear_pca', PCA()), ('kernel_pca', KernelPCA())]
>>> combined = FeatureUnion(estimators)
>>> combined
FeatureUnion(n_jobs=1, transformer_list=[('linear_pca', PCA(copy=True,
n_components=None, whiten=False)), ('kernel_pca', KernelPCA(alpha=1.0,
coef0=1, degree=3, eigen_solver='auto', fit_inverse_transform=False,
gamma=None, kernel='linear', kernel_params=None, max_iter=None,
n_components=None, remove_zero_eig=False, tol=0))],
transformer_weights=None)
Like pipelines, feature unions have a shorthand constructor called
make_union
that does not require explicit naming of the components.