Data¶
The data
module provides APIs to 1) load and parse datasets, 2)
transform data examples, and 3) sample mini-batches for the training
program. This tutorial will go through these three functionalities.
Let’s first import the modules needed, the key module is
mxnet.gluon.data
, which is imported as gdata
to avoid the too
commonly used name data
.
In [1]:
import numpy as np
from matplotlib import pyplot as plt
import tarfile
import time
from mxnet import nd, image, io
from mxnet.gluon import utils, data as gdata
Load and Parse Datasets¶
To use a dataset, we first need to parse it into individual examples
with the NDArray
data type. The base class is Dataset
, which
defines two essential methods: __getitem__
to access the i-th
example and __len__
to get the number of examples.
ArrayDataset
is an implementation of Dataset
that combines
multiple array-like objects into a dataset. In the following example, we
define NDArray features X
with NDArray label y
, and then create
the dataset.
In [2]:
features = nd.random.uniform(shape=(10, 3))
labels = nd.arange(10)
dataset = gdata.ArrayDataset(features, labels)
We can query the number examples in this dataset:
In [3]:
len(dataset)
Out[3]:
10
And access an arbitrary example with its index. The returned example is a list contains its features and label.
In [4]:
sample = dataset[1]
'features:', sample[0], 'label:', sample[1]
Out[4]:
('features:',
[0.84426576 0.60276335 0.8579456 ]
<NDArray 3 @cpu(0)>, 'label:', 1.0)
Note that the label for each example is a scalar, it is automatically
converted into numpy to make using as an index easy, e.g. we don’t need
to call sample[1].asscalar()
.
In [5]:
type(sample[1])
Out[5]:
numpy.float32
In addition, ArrayDataset
can construct a dataset with any
array-like objects, with an arbitrary number of arrays:
In [6]:
dataset2 = gdata.ArrayDataset(features, np.random.uniform(size=(10,1)), list(range(0,10)))
sample = dataset2[1]
type(sample[0]), type(sample[1]), type(sample[2])
Out[6]:
(mxnet.ndarray.ndarray.NDArray, numpy.ndarray, int)
Predefined Datasets¶
This module provides several commonly used datasets that will be automatically downloaded during creation. For example, we can obtain both the training and validation set of MNIST:
In [7]:
mnist_train = gdata.vision.MNIST()
mnist_valid = gdata.vision.MNIST(train=False)
print('# of training examples =', len(mnist_train))
print('# of validation examples =', len(mnist_valid))
# of training examples = 60000
# of validation examples = 10000
Obtaining one example is as similar as before:
In [8]:
sample = mnist_train[1]
print('X shape:', sample[0].shape)
print('y:', sample[1])
X shape: (28, 28, 1)
y: 0
Besides MNIST, mxnet.gluon.data.vision
provides these three
datasets: FashionMNIST, CIFAR10, and CIFAR100.
Load Individual Images¶
In vision tasks, the examples are often stored as individual images
files. If we place images within each category in separate folders, then
we can use ImageFolderDataset
to load both images and labels.
Let’s download a tiny image dataset as an example.
In [9]:
utils.download('https://github.com/dmlc/web-data/raw/master/mxnet/doc/dogcat.tar.gz')
with tarfile.open('dogcat.tar.gz') as f:
f.extractall()
Downloading dogcat.tar.gz from https://github.com/dmlc/web-data/raw/master/mxnet/doc/dogcat.tar.gz...
Then check contents within this dataset:
In [10]:
# You may need to install the `tree' program, such as uncommenting the following line for Ubuntu:
# !sudo apt-get install tree
!tree dogcat
dogcat
├── cat
│ ├── cat1.jpg
│ ├── cat2.jpg
│ ├── cat3.png
│ └── cat4.jpeg
└── dog
├── dog1.jpg
├── dog2.jpg
├── dog3.jpg
└── dog4.png
2 directories, 8 files
As can be seen, it has two categories, dog and cat. Image files are
placed in subfolders with categories as folder names. Now construct an
ImageFolderDataset
instance with specifying the dataset root folder.
In [11]:
dogcat = gdata.vision.ImageFolderDataset('./dogcat')
We can access all categories through the attribute synsets
:
In [12]:
dogcat.synsets
Out[12]:
['cat', 'dog']
Next let’s print a particular sample with its label:
In [13]:
sample = dogcat[1]
plt.imshow(sample[0].asnumpy())
plt.show()
'label:', dogcat.synsets[sample[1]]

Out[13]:
('label:', 'cat')
Transform Data Examples¶
The raw data examples often need to be transformed before feeding into a
neural network. Class Dataset
provides two methods transform
and
transform_first
to allow users to specify the transformation
methods.
In the following example, we define a function to resize an image into
200px height and 300px width. And then we pass it into the dataset
through the transform_first
method, which returns a new dataset with
the transformations recorded.
In [14]:
def resize(x):
y = image.imresize(x, 300, 200)
print('resize', x.shape, 'into', y.shape)
return y
dogcat_resized = dogcat.transform_first(resize)
As can be seen, transformations are applied when accessing examples, which is necessary when these transformations contain randomness. But we can also apply all transformations during creating the dataset.
In [15]:
dogcat_cached = dogcat.transform_first(resize, lazy=False)
resize (400, 600, 3) into (200, 300, 3)
resize (428, 590, 3) into (200, 300, 3)
resize (519, 510, 3) into (200, 300, 3)
resize (380, 645, 3) into (200, 300, 3)
resize (452, 400, 3) into (200, 300, 3)
resize (486, 729, 3) into (200, 300, 3)
resize (267, 400, 3) into (200, 300, 3)
resize (486, 729, 3) into (200, 300, 3)
So no transform is needed during accessing examples.
In [16]:
dogcat_cached[0][0].shape
Out[16]:
(200, 300, 3)
Besides transform_first
, we can apply transform to all entries in an
example. The following examples add the label by 10 besides resizing an
image:
In [17]:
dogcat_both = dogcat.transform(lambda x, y: (resize(x), y+10))
dogcat_both[0][1]
resize (400, 600, 3) into (200, 300, 3)
Out[17]:
10
The vision.transform
submodule provide multiple pre-defined data
transformation methods. For example, the following example chains the
image resize and ToTensor
, which transforms the data layout into (C
x H x W) with float32 data type. Please refer to Image
Augmentation for more details.
In [18]:
transforms = gdata.vision.transforms.Compose([
gdata.vision.transforms.Resize((24, 24)),
gdata.vision.transforms.ToTensor()])
mnist_transformed = mnist_train.transform_first(transforms, lazy=True)
(mnist_train[0][0].shape, '->', mnist_transformed[0][0].shape)
Out[18]:
((28, 28, 1), '->', (1, 24, 24))
Sample Mini-batches¶
If we train a neural network with mini-batch SGD, we need to sample a
mini-batch for every iteration. Class DataLoader
samples a dataset
into mini-batches. In the following example, we create a DataLoader
instance, which is an iterator that returns a mini-batch each time.
In [19]:
data = gdata.DataLoader(dataset, batch_size=4)
for X, y in data:
print('X shape:', X.shape, '\ty:', y.asnumpy())
X shape: (4, 3) y: [0. 1. 2. 3.]
X shape: (4, 3) y: [4. 5. 6. 7.]
X shape: (2, 3) y: [8. 9.]
Since the number of examples can not be divided by the batch size, the last min- batch only has two examples. We can chose to ignore the last incomplete mini- batch:
In [20]:
data = gdata.DataLoader(dataset, batch_size=4, last_batch='discard')
for X, y in data:
print('y:', y.asnumpy())
y: [0. 1. 2. 3.]
y: [4. 5. 6. 7.]
Or put it into the beginning of the next epoch:
In [21]:
data = gdata.DataLoader(dataset, batch_size=4, last_batch='rollover')
for X, y in data:
print('epoch 0, y:', y.asnumpy())
for X, y in data:
print('epoch 1, y:', y.asnumpy())
epoch 0, y: [0. 1. 2. 3.]
epoch 0, y: [4. 5. 6. 7.]
epoch 1, y: [8. 9. 0. 1.]
epoch 1, y: [2. 3. 4. 5.]
epoch 1, y: [6. 7. 8. 9.]
In mini-batch SGD, a mini-batch needs to consist of randomly sampled
examples. We can set the shuffle
argument to get random batches:
In [22]:
data = gdata.DataLoader(dataset, batch_size=4, shuffle=True)
for X, y in data:
print('y:', y.asnumpy())
y: [5. 8. 6. 4.]
y: [0. 3. 9. 7.]
y: [2. 1.]
Customize Sampling¶
DataLoader
reads examples either sequentially or uniformly randomly
without replacement. We can change this behavior through customized
samplers. An sampler is an iterator returning an sample index each time.
For example, we create an sampler that first sequentially reads even
indexes and then odd indexes.
In [23]:
class MySampler():
def __init__(self, length):
self.len = length
def __iter__(self):
for i in list(range(0,self.len,2))+list(range(1,self.len,2)):
yield i
data = gdata.DataLoader(dataset, batch_size=4, sampler=MySampler(len(dataset)))
for X, y in data:
print(y.asnumpy())
[0. 2. 4. 6.]
[8. 1. 3. 5.]
[7. 9.]
Similarly, we can change how a mini-batches is sampled through the
batch_sampler
argument.
Multi-process¶
Reading data is often one of the major performance bottlenecks. We can accelerate it through multi-process (only Linux and Macos are supported.) Let’s first benchmark the time to read the MNIST training set:
In [24]:
tic = time.time()
data = gdata.DataLoader(mnist_transformed, batch_size=64)
for X, y in data:
pass
'%.1f sec' % (time.time() - tic)
Out[24]:
'9.8 sec'
Now let’s use 4 processes:
In [25]:
tic = time.time()
data = gdata.DataLoader(mnist_transformed, batch_size=64, num_workers=4)
for X, y in data:
pass
'%.1f sec' % (time.time() - tic)
Out[25]:
'2.1 sec'
Appendix: From DataIter
to DataLoader
¶
Before Gluon’s DataLoader
, MXNet provides DataIter
in the io
module to read mini-batches. They are similar to each other but
DataLoader
often returns a tuple of (feature, label)
for a
mini-batches, while DataIter
returns a DataBatch
. The following
example wraps a DataIter
into a DataLoader
so you can reuse the
existing codes but enjoys the benefits of Gluon.
In [26]:
class DataIterLoader():
def __init__(self, data_iter):
self.data_iter = data_iter
def __iter__(self):
self.data_iter.reset()
return self
def __next__(self):
batch = self.data_iter.__next__()
assert len(batch.data) == len(batch.label) == 1
data = batch.data[0]
label = batch.label[0]
return data, label
def next(self):
return self.__next__() # for Python 2
Now create an DataIter
instance, and then get the according
DataLoader
wrapper.
In [27]:
data_iter = io.NDArrayIter(data=features, label=labels, batch_size=4)
data = DataIterLoader(data_iter)
for X, y in data:
print('X shape:', X.shape, '\ty:', y.asnumpy())
X shape: (4, 3) y: [0. 1. 2. 3.]
X shape: (4, 3) y: [4. 5. 6. 7.]
X shape: (4, 3) y: [8. 9. 0. 1.]