-
Notifications
You must be signed in to change notification settings - Fork 17
Expand file tree
/
Copy pathout.md~
More file actions
399 lines (327 loc) · 22.9 KB
/
out.md~
File metadata and controls
399 lines (327 loc) · 22.9 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
Undrestanding Convolutional Layers in Convolutional Neural Networks (CNNs)
--------------------------------------------------------------------------
A comprehensive tutorial towards 2D Convolutional layers
Introduction
------------
2D Convolutional Layers constitute Convolutional Neural Networks (CNNs)
along with Pooling and fully-connected layers and create the basis of
deep learning. So if you want to go deeper into CNNs and deep learning,
the first step is to get more familiar with how Convolutional Layers
work. If you are not familiar with applying 2D filters on images, we
urgely suggest you to first have a look at our previous post about image
filtering [here](image_convolution_1.html). In the [image
filtering](image_convolution_1.html) post, we talked about convolving a
filter with an image. In that post, we had a 2D filter kernel (a 2D
matrix) and a single channel image (grayscale image). To calculate the
convolution, we swept the kernel (if you remember we should flip the
kernel first and then do the convolution, for the rest of this post we
assumed that the kernel is already flipped) on the image and at every
single location we calculated the output. In fact, the **stride** of our
convolution was 1. You might say what is a stride? **stride** is the
number of pixels with which we slide our filter, horizontally or
vertically. In other words, in that case we moved our filter one pixel
at each step to calculate the next convoluion output. However, for a
convolution with stride 2, we calculate the output for every other pixel
(or jump 2 pixels) and as a contrary the output of the convolution would
be roughly half the size of the input image. Figure 1 compares two 2D
convolutions with strides one and two, respectively.


Note that ,you can have different strides horizontally and vertically.
You can use the following equations to calculate the exact size of the
convolution output for an input with the size of (width = <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/84c95f91a742c9ceb460a83f9b5090bf.svg?invert_in_darkmode" align=middle width=18.13053pt height=21.69783pt/>, height
= <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/7b9a0316a2fcd7f01cfd556eedf72e96.svg?invert_in_darkmode" align=middle width=15.32223pt height=21.69783pt/>) and a Filter with the size of (width = <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/a6f127b4375ffe34a939afe6f6d88a07.svg?invert_in_darkmode" align=middle width=20.71245pt height=21.69783pt/>, height =
<img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/1ab079395027c7fee523b46ac98a4e9a.svg?invert_in_darkmode" align=middle width=18.589065pt height=21.69783pt/>):
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/1007d29ec0d1c66f578ea22be4689b30.svg?invert_in_darkmode" align=middle width=249.40905pt height=36.09507pt/></p>
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/4bc21c1e2ef5ec53ed371edf9a9fab1f.svg?invert_in_darkmode" align=middle width=247.67325pt height=36.09507pt/></p>
where <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/None/svgs/9b9b57cc812f2598082fbae95c2eb73d.svg?invert_in_darkmode" align=middle width=17.84706pt height=13.38744pt/> and <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/None/svgs/830b6bb7e31ccb49ce56184a3eca880b.svg?invert_in_darkmode" align=middle width=15.72384pt height=13.38744pt/> are horizontal and vertical stride of the
convolution, respectively, and <img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/None/svgs/df5a289587a2f0247a5b97c1e8ac58ca.svg?invert_in_darkmode" align=middle width=13.15908pt height=21.69783pt/> is the amount of zero padding added
to the border of the image (Look at the [previous post]() if you are not
familiar with the zero padding concept). However, the output width or
height calculated from these equations might be a non-integer value. In
that case, you might want to handle the situation in any way to satisfy
the desired output dimention. Here, we explain how **Tensorflow**
approachs the issue. In general you have two main options for padding
scheme which determine the output size, namely **'SAME'** and
**'VALID'** padding schemes. In 'SAME' padding scheme, in which we have
zero padding, the size of output will be
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/7281ba59279296e2b5148cc2b893b6d9.svg?invert_in_darkmode" align=middle width=426.8385pt height=36.09507pt/></p>
If the required number of pixels for padding to have the desired output
size is a even number, we can simply add half of that to each side of
the input (left and rigth or up and bottom). However, if it is an odd
number, we need an uneven number of zero on the left and the right sides
of the input (for horizontal padding) or the top and the bottom sides of
the input (for vertical padding). Here is how Tensorflow calculates
required padding in each side:
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/8306be941129dcf27d8b09ef447e398d.svg?invert_in_darkmode" align=middle width=521.62605pt height=16.438356pt/></p>
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/d60118b3946920a43247967f9eaf3b2f.svg?invert_in_darkmode" align=middle width=522.58635pt height=16.438356pt/></p>
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/b5cce0ee4b88deabb525e7ff72f1d868.svg?invert_in_darkmode" align=middle width=515.9352pt height=33.629475pt/></p>
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/8195c3020edae6015bd8f1d340aa8a67.svg?invert_in_darkmode" align=middle width=433.98795pt height=14.611872pt/></p>
Similarly, in the 'VALID' padding scheme which we do not add any zero
padding to the input, the size of the output would be
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/4c011d601ffc424c9759e62ee49cdd29.svg?invert_in_darkmode" align=middle width=557.4327pt height=36.09507pt/></p>
Let's get back to the Convolutional layer. A convolution layer does
exactly the same: applying a filter on an input in convolutionl manner.
Likewise Fully-Connected layers, a Convolutional layer has a weight,
which is its kernel (filter), and a bias. But in contrast to the
fully-connected layers, in convolutional layers each pixel (or neuron)
of the output is connected to the input pixels (neurons) locally instead
of being connected to all input pixels (neurons). Hence, we use the term
of **receptive field** for the size of convolutional layer's filter.
Bias in a convolutional layer is a unique scalar value which is added to
the output of Convolutional Layer's filter at every single pixel. What
we talked about so far, was in fact a Convolutional layer with 1 input
and 1 output **channel** (also known as **depth**) and a zero bias.
Generally, a convolution layer can have multiple input channels (each a
2D matrix) and multiple output channels (again each a 2D matrix). Maybe
the most tangible example of a multi-channel input is when you have a
color image which has 3 RGB channels. Let's get it to a convolution
layer with 3 input channels and 1 output channel. How is it going to
cacluate the output? A short answer is that it has 3 filters (one for
each input) instead of one input. What it does is that it calculates the
convolution of each filter with its corresponding input channel (First
filter with first channel, second filter with second channel and so on).
The stride of all channels are the same, so they output matrices with
the same size. Now, it sum up all matrices and output a single matrix
which is the only channel at the output of the convolution layer. For
better underestanding, you can have a look at Figure 2.

Let's modify our convolution code in the previous post and make a 2D
Convolutional Layer:
```python
import matplotlib.pyplot as plt
from scipy import misc
import numpy as np
from skimage import exposure
from math import ceil
def convolution2d(conv_input, conv_kernel, bias=0, strides=(1, 1), padding='same'):
# This function which takes an input (Tensor) and a kernel (Tensor)
# and returns the convolution of them
# Args:
# conv_input: a numpy array of size [input_height, input_width, input # of channels].
# conv_kernel: a numpy array of size [kernel_height, kernel_width, input # of channels]
# represents the kernel of the Convolutional Layer's filter.
# bias: a scalar value, represents the bias of the Convolutional Layer's filter.
# strides: a tuple of (convolution vertical stride, convolution horizontal stride).
# padding: type of the padding scheme: 'same' or 'valid'.
# Returns:
# a numpy array (convolution output).
assert len(conv_kernel.shape) == 3, "The size of the kernel should be (kernel_height, kernel_width, input # of channels)"
assert len(conv_input.shape) == 3, "The size of the input should be (input_height, input_width, input # of channels)"
assert conv_kernel.shape[2] == conv_input.shape[2], "the input and the kernel should have the same depth."
input_w, input_h = conv_input.shape[1], conv_input.shape[0] # input_width and input_height
kernel_w, kernel_h = conv_kernel.shape[1], conv_kernel.shape[0] # kernel_width and kernel_height
if padding == 'same':
output_height = int(ceil(float(input_h) / float(strides[0])))
output_width = int(ceil(float(input_w) / float(strides[1])))
# Calculate the number of zeros which are needed to add as padding
pad_along_height = max((output_height - 1) * strides[0] + kernel_h - input_h, 0)
pad_along_width = max((output_width - 1) * strides[1] + kernel_w - input_w, 0)
pad_top = pad_along_height // 2 # amount of zero padding on the top
pad_bottom = pad_along_height - pad_top # amount of zero padding on the bottom
pad_left = pad_along_width // 2 # amount of zero padding on the left
pad_right = pad_along_width - pad_left # amount of zero padding on the right
output = np.zeros((output_height, output_width)) # convolution output
# Add zero padding to the input image
image_padded = np.zeros((conv_input.shape[0] + pad_along_height,
conv_input.shape[1] + pad_along_width, conv_input.shape[2]))
image_padded[pad_top:-pad_bottom, pad_left:-pad_right, :] = conv_input
for x in range(output_width): # Loop over every pixel of the output
for y in range(output_height):
# element-wise multiplication of the kernel and the image
output[y, x] = (conv_kernel * image_padded[y * strides[0]:y * strides[0] + kernel_h,
x * strides[1]:x * strides[1] + kernel_w, :]).sum() + bias
elif padding == 'valid':
output_height = int(ceil(float(input_h - kernel_h + 1) / float(strides[0])))
output_width = int(ceil(float(input_w - kernel_w + 1) / float(strides[1])))
output = np.zeros((output_height, output_width)) # convolution output
for x in range(output_width): # Loop over every pixel of the output
for y in range(output_height):
# element-wise multiplication of the kernel and the image
output[y, x] = (conv_kernel * conv_input[y * strides[0]:y * strides[0] + kernel_h,
x * strides[1]:x * strides[1] + kernel_w, :]).sum() + bias
return output
# load the image as RGB (3 channels)
img = misc.imread('image.png', mode='RGB')
# The edge detection kernel
kernel = np.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]])[..., None]
kernel = np.repeat(kernel, 3, axis=2)
# Convolve image and kernel
image_edges= convolution2d(img, kernel)
# Plot the filtered image
plt.imshow(image_edges, cmap=plt.cm.gray)
plt.axis('off')
plt.show()
# Adjust the contrast of the filtered image by applying Histogram Equalization
image_edges_equalized = exposure.equalize_adapthist(image_edges / np.max(np.abs(image_edges)),
clip_limit=0.03)
plt.imshow(image_edges_equalized, cmap=plt.cm.gray)
plt.axis('off')
plt.show()
```
What about when the convolution layer has more than one output channels.
In that case, the layer has a different multi-channel filter (the number
of its channel is equal to the number of input channels) to calculate
each output. For example, assume we have a layer with three input
channels (RGB) and five output channels. This layer would have 5
filters, and 3 channels per filter. It uses each filter (3 channels) to
compute the corresponding output from the input channels. In other
words, it uses the first 3-channel filter to calculate the first channel
of the output and so on. Note that each output channel has its own bias.
Therefore, the number of biases in each Convolutional layer is equal to
the number of output channels. Now, let's modify the previous code to
handle more than one channel at output.
```python
import matplotlib.pyplot as plt
from scipy import misc
import numpy as np
from skimage import exposure
from math import ceil
def convolution2d(conv_input, conv_kernel, bias, strides=(1, 1), padding='same'):
# This function which takes an input (Tensor) and a kernel (Tensor)
# and returns the convolution of them
# Args:
# conv_input: a numpy array of size [input_height, input_width, input # of channels].
# conv_kernel: a numpy array of size [kernel_height, kernel_width, input # of channels,
# output # of channels] represents the kernel of the Convolutional Layer's filter.
# bias: a numpy array of size [output # of channels], represents the bias of the Convolutional
# Layer's filter.
# strides: a tuple of (convolution vertical stride, convolution horizontal stride).
# padding: type of the padding scheme: 'same' or 'valid'.
# Returns:
# a numpy array (convolution output).
assert len(conv_kernel.shape) == 4, "The size of kernel should be (kernel_height, kernel_width, input # of channels, output # of channels)"
assert len(conv_input.shape) == 3, "The size of input should be (input_height, input_width, input # of channels)"
assert conv_kernel.shape[2] == conv_input.shape[2], "the input and the kernel should have the same depth."
input_w, input_h = conv_input.shape[1], conv_input.shape[0] # input_width and input_height
kernel_w, kernel_h = conv_kernel.shape[1], conv_kernel.shape[0] # kernel_width and kernel_height
output_depth = conv_kernel.shape[3]
if padding == 'same':
output_height = int(ceil(float(input_h) / float(strides[0])))
output_width = int(ceil(float(input_w) / float(strides[1])))
# Calculate the number of zeros which are needed to add as padding
pad_along_height = max((output_height - 1) * strides[0] + kernel_h - input_h, 0)
pad_along_width = max((output_width - 1) * strides[1] + kernel_w - input_w, 0)
pad_top = pad_along_height // 2 # amount of zero padding on the top
pad_bottom = pad_along_height - pad_top # amount of zero padding on the bottom
pad_left = pad_along_width // 2 # amount of zero padding on the left
pad_right = pad_along_width - pad_left # amount of zero padding on the right
output = np.zeros((output_height, output_width, output_depth)) # convolution output
# Add zero padding to the input image
image_padded = np.zeros((conv_input.shape[0] + pad_along_height,
conv_input.shape[1] + pad_along_width, conv_input.shape[2]))
image_padded[pad_top:-pad_bottom, pad_left:-pad_right, :] = conv_input
for ch in range(output_depth):
for x in range(output_width): # Loop over every pixel of the output
for y in range(output_height):
# element-wise multiplication of the kernel and the image
output[y, x, ch] = (conv_kernel[..., ch] *
image_padded[y * strides[0]:y * strides[0] + kernel_h,
x * strides[1]:x * strides[1] + kernel_w, :]).sum() + bias[ch]
elif padding == 'valid':
output_height = int(ceil(float(input_h - kernel_h + 1) / float(strides[0])))
output_width = int(ceil(float(input_w - kernel_w + 1) / float(strides[1])))
output = np.zeros((output_height, output_width, output_depth)) # convolution output
for ch in range(output_depth):
for x in range(output_width): # Loop over every pixel of the output
for y in range(output_height):
# element-wise multiplication of the kernel and the image
output[y, x, ch] = (conv_kernel[..., ch] *
conv_input[y * strides[0]:y * strides[0] + kernel_h,
x * strides[1]:x * strides[1] + kernel_w, :]).sum() + bias[ch]
return output
# load the image
img = misc.imread('image2.jpg', mode='RGB')
# The edge detection kernel
kernel1 = np.array([[-1, -1, -1], [-1, 8, -1], [-1, -1, -1]])[..., None]
kernel1 = np.repeat(kernel1, 3, axis=2)
# The blur kernel
kernel2 = np.array([[1, 1, 1], [1, 1, 1], [1, 1, 1]])[..., None]/9.0
kernel2 = np.repeat(kernel2, 3, axis=2)
kernel = np.zeros_like(kernel1, dtype=np.float)[..., None]
kernel = np.repeat(kernel, 2, axis=3)
kernel[..., 0] = kernel1
kernel[..., 1] = kernel2
# Convolve image and kernel
image_edges = convolution2d(img*255, kernel, bias=[1, 0])
# Adjust the contrast and plot the first channel of the output
image_edges_equalized = exposure.equalize_adapthist(image_edges[..., 0] /
np.max(np.abs(image_edges[..., 0])), clip_limit=0.03)
plt.figure(1)
# Plot the first channel of the output
plt.subplot(221)
plt.imshow(image_edges_equalized, cmap=plt.cm.gray)
plt.axis('off')
# Plot the second channel of the output
plt.subplot(222)
plt.imshow(image_edges[..., 1], cmap=plt.cm.gray)
plt.axis('off')
# Plot the input
plt.subplot(223)
plt.imshow(img, cmap=plt.cm.gray)
plt.axis('off')
plt.show()
```
To test the code, we created an convolutional layer which has two
filters. A edge detection filter on all 3 channels and a blur filter.

In brief, **stride**, **zero-padding**, and the **depth** determine the
spatial size of the output in a convolutional layer. The depth in fact
is a hyperparameter which is set by who is designing the network
(including the convolutional layer) and is equal to the number of
filters you want to use. Each filter would be desired to learn different
property or aspect of an image.
Even though we almost covered the overal opertaion of a convolutional
layer, we are not done yet. Similar to a fully-connected layer, the
output of a convolutional layer usually pass to an elementwise
activation function. The activation function helps to add nonlinearity
to the network as a pure convolution is a linear operation in
Mathematics point of view. One of the most common activation functions
in the area of deep learning is **RELU** which is defined as:
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/f90cc2eae2be613dcd5666264b6272b0.svg?invert_in_darkmode" align=middle width=164.2146pt height=16.438356pt/></p>
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/f90cc2eae2be613dcd5666264b6272b0.svg?invert_in_darkmode" align=middle width=164.2146pt height=16.438356pt/></p>
If you want to add RELU to our latest version of convolutional layer,
you just need to replace `return output` with
`return np.maximum(output, 0)`
Good to know
------------
Know that you know how a convolutional layer works, it's time to cover
some usefull details:
- **Number of parameters:** When you are designing your network,
number of trainable parameters significantly matters. Therefore, it
is good to know how many parameters your convolutional layer would
add up to your network. What you train in a convolutional layer are
its filters and biases. Then, you can easily calculate its number of
parameters using the following equation:
where $d_i$, and $d_o$ are depth (\# of channels) of the input
and depth of the output, respectively. Note that the one inside the
parenthesis is to count the biases.
<p align="center"><img src="https://rawgit.com/Machinelearninguru/Image-Processing-Computer-Vision/master/svgs/38cb4271050f99f4f846a00910dde404.svg?invert_in_darkmode" align=middle width=350.88075pt height=16.438356pt/></p>
- **Locally-Connected Layer:** This type of layer is quite the same as
the Convolutional layer explained in this post but with only one
(important) difference. In the Convolutional layer the filter was
common among all output neurons (pixels). In other words, we used a
single filter to calculate all neurons (pixels) of an output
channels. However, in Locally-Connected Layer each neuron (pixel)
has its own filter. It means the number of parameters will be
multiplied by the number of output neurons. It drastically could
increase the number of parameters and if you do not have enough
data, you might end up with an over-fitting issue. However, this
type of layer let your network to learn different types of feature
for different regions of the input. Researchers, got benefit of this
helpful property of Locally-Connected Layers specially in face
verification such as
[DeepFace](http://www.cv-foundation.org/openaccess/content_cvpr_2014/papers/Taigman_DeepFace_Closing_the_2014_CVPR_paper.pdf)
and [DeepID3](https://arxiv.org/abs/1502.00873). Still, some other
researchers use a distinct filter for each region of the input
instead of each neuron (pixel) to get benefit of Locally-Connected
Layers with less number of parameters.
- **Convolution layers with 1X1 filter size:** Even though using a 1X1
filter does not make sense at first glance in image processing point
of view, it can help by adding nonlinearity to your network. In
fact, a 1X1 filter calculate a linear combination of all
corresponding pixels (nuerons) of the input channels and output the
result through an activation function which adds up the
nonlinearity.
**What Next?** In the next post we will get more familiar with
backpropagation and how to train a convolutional neural network.