CNNs for the Fashion MNIST

Can machines recognize tops from trousers?

This is part two of a two-part series about the Fashion MNIST dataset. I’m writing about a convolutional neural network here. In part one, I built a basic neural network, which you can read about here.

The Fashion MNIST dataset is a vast collection of monochrome photos of 28 by 28 pixels of articles of clothing. The challenge is to train a computer to recognize different items of clothing using these photos.

The first 20 photos and labels of the Fashion MNIST testing set.

We’ve already built a dense neural network that reached an average accuracy of about 85%. If we want to get an accuracy above 90% though, it’s better to use a convolutional neural network (CNN).

Why use a convolutional neural network?

Imagine I ask you to identify the object in the photo below.

It’s clearly a tree. Photo by Johann Siemens from Unsplash.

It’s clearly a tree, you say.

How would you justify?

Well, it has a trunk? And it has branches that branch out into smaller branches, and then into lots of green leaves.

Perfect. You just described the characteristics of a tree. The human brain works so fast that we don’t even realize how we make conclusions. But when I ask for justification, you start naming the patterns you’ve learned to recognize as composing a tree.

CNNs for feature recognition

Machines can learn the same way. We can teach them to recognize the specific features that make a shoe a shoe.

While the standard neural network classifies well , convolutional neural networks are designed specifically for images. They are therefore tailored for effective feature recognition in images.

Let’s look at photo 9 as example. We want to teach the model to recognize a sandal.

In our regular dense neural network, we flattened the inputs into a vector, then combed through that to find similarities. In practice, that means that the model will begin to recognize certain features in a particular place.It might look for the short vamp on the left side, a diagonal sole in the middle, and then a vertical block on the right side.

The model sees the vamp on the left side, the sole in a diagonal, and the heel on the right side.

While this works well enough to a certain degree, the model might get stuck looking for features in this order. Reducing data to a simple vector can make computations more efficient, but we also sacrifice some of the dimensional details of a photo. It might only look for the heel on the right side only, and never look for it on the left side. If we were to flip that photo of high heels, it might incorrectly believe that there is no heel.

The model looks in the same places for the same features, but it won’t find what it’s looking for.

CNNs can correct for this oversight. Instead of looking at the photo as a vectorized whole, they notice the multi-dimensional features. They create feature maps to predict item classifications.

By taking into account the spacial-temporal relations between the pixels, they can identify repeated features regardless of the order; consequently, the orientation of a photo won’t play as big of role.

The model recognizes the vamp, the sole, and the heel, despite the mixed up order.

How does the CNN work?

Here’s a quick refresher of CNN terminology:

  • Filters are small frames through which the model identifies the features. In the case above, the filter is a 7x7 square.
  • Strides are what determines how much the filter will move. If your stride length is too small or too large, then the model might identify features too complex or too general.
  • Pooling reduces the size of a convolutional layer for more efficient processing. The output is usually the biggest
  • Padding is adding extra pixels on the photo border so that the filter covers the entire image. This is useful when a photo cuts off part of a feature, but it’s less relevant in this example as the Fashion MNIST dataset does not cut off any photos.
This convolution operation has a filter of 3 by 3 with stride length 2 and uses “same” padding. Source.

The basic mechanism of a CNN is as follows:

  1. We input the image data as numerical values representing the colour value.
  2. The model performs a convolution operation, calculating the dot product of the inputs in a filter.
  3. Repeat step 2 each time the model takes a stride and passes a filter over a new area. This creates the feature map.
  4. Using pooling, the model reduces the size of the feature map even further.
  5. Repeat steps 3–4 at your discretion.
  6. We put it through a fully-connected layer for final classification. This is basically the dense neural network we built in part one.
For a more comprehensive view of CNNs, you can also read this article.

Building the model

Since we already built the framework for a neural network, we just need to change the model to that of a CNN. We can skip to the section where we define the model.

We’re still using a sequential model for the CNN, meaning it will take information from one layer to the next, and on and on, without any returning inputs. This time, we’re also introducing the Keras Conv2D class. Within it, we specify parameters to create a feature map:

  • The number of filters: 32. This corresponds with the number of nodes in this convolutional layer.
  • The size of the filter: 3 by 3. The network will identify features that fit within the filter.
  • The padding: same. There will be padding on the border to take into account all the pixels.
  • The activation function: ReLU. This activation function reduces computing time because it only considers nodes with positive outputs.
  • The input shape: 28, 28, 1. The image contains 28 by 28 values, but we’re only considering one colour channel, greyscale. If the images were coloured, it might look like 28, 28, 3. We only need this for the first layer in the network.
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), padding='same', activation='relu', input_shape=(28,28,1)))

We perform a pooling operation on that layer to reduce its dimensions further while conserving features. Within a 2 by 2 filter on the feature map, the model will keep the largest value. It will move the filters by 2 each time.

model.add(layers.MaxPooling2D((2,2), strides = 2))

We can repeat with another convolutional layer and pooling operation, this time with 64 filters. The number of filters generally increases as we get closer to the output.

model.add(layers.Conv2D(64, (3,3), padding='same', activation = 'relu'))
model.add(layers.MaxPooling2D((2,2), strides = 2))

The number of layers depends on the complexity of the problem. The Fashion MNIST dataset isn’t very challenging, so three layers should be enough. The last layer will be the a return to a regular dense neural network; it’s basically the same one we used in part one.

model.add(layers.Dense(128, activation = 'relu'))
model.add(layers.Dense(10, activation = 'softmax'))

Since we specified our input_shape to be (28, 28, 1), we need to reshape our data to match it. Without changing the values, we can make just add another dimension to represent the number of colour channels, 1, the greyscale. We'll change it back to (28, 28) when we show the image at the end.

train_X = tf.reshape(train_X, [60000, 28, 28, 1])
test_X = tf.reshape(test_X, [10000, 28, 28, 1])

That's it! You can see the full code here. If you run through it, you can test the accuracy yourself. Choose a random photo by typing in its number, then check if the model classified it correctly.

This code is based off of my original code and the Intro to TensorFlow for Deep Learning course. Definitely try it out for a beginner machine learning project.

If you haven’t already, read the first part here to see how a regular neural network would work.

If you liked this article, follow me on Medium and LinkedIn to see more in the future! Reach out if you have any questions at I’d be happy to get in touch!

Learning how tomorrow's technologies will transform today's future. Especially interested in artificial intelligence and climate solutions.