5.5.3. Downloading datasets from the mldata.org repository¶
mldata.org is a public repository for machine learning data, supported by the PASCAL network .
The sklearn.datasets
package is able to directly download data
sets from the repository using the function
sklearn.datasets.fetch_mldata
.
For example, to download the MNIST digit recognition database:
>>> from sklearn.datasets import fetch_mldata
>>> mnist = fetch_mldata('MNIST original', data_home=custom_data_home)
The MNIST database contains a total of 70000 examples of handwritten digits of size 28x28 pixels, labeled from 0 to 9:
>>> mnist.data.shape
(70000, 784)
>>> mnist.target.shape
(70000,)
>>> np.unique(mnist.target)
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
After the first download, the dataset is cached locally in the path
specified by the data_home
keyword argument, which defaults to
~/scikit_learn_data/
:
>>> os.listdir(os.path.join(custom_data_home, 'mldata'))
['mnist-original.mat']
Data sets in mldata.org do not adhere to a strict
naming or formatting convention. sklearn.datasets.fetch_mldata
is
able to make sense of the most common cases, but allows to tailor the
defaults to individual datasets:
The data arrays in mldata.org are most often shaped as
(n_features, n_samples)
. This is the opposite of thescikit-learn
convention, sosklearn.datasets.fetch_mldata
transposes the matrix by default. Thetranspose_data
keyword controls this behavior:>>> iris = fetch_mldata('iris', data_home=custom_data_home) >>> iris.data.shape (150, 4) >>> iris = fetch_mldata('iris', transpose_data=False, ... data_home=custom_data_home) >>> iris.data.shape (4, 150)
For datasets with multiple columns,
sklearn.datasets.fetch_mldata
tries to identify the target and data columns and rename them totarget
anddata
. This is done by looking for arrays namedlabel
anddata
in the dataset, and failing that by choosing the first array to betarget
and the second to bedata
. This behavior can be changed with thetarget_name
anddata_name
keywords, setting them to a specific name or index number (the name and order of the columns in the datasets can be found at its mldata.org under the tab “Data”:>>> iris2 = fetch_mldata('datasets-UCI iris', target_name=1, data_name=0, ... data_home=custom_data_home) >>> iris3 = fetch_mldata('datasets-UCI iris', target_name='class', ... data_name='double0', data_home=custom_data_home)