Biologically inspired image representation
Christian Theriault   ,   Nicolas Thome   ,   Matthieu Cord
Universite Pierre et Marie Curie, UPMC-Sorbonne Universities, LIP6, Paris, France
PUBLICATIONS:
1.HMAX-S: Deep scale representation for biologically inspired image categorization. [pdf]:
Christian Theriault, Nicolas Thome, Matthieu Cord. ICIP 2011, p 1261-1264, ISBN: 978-1-4577-1304-0, Brussels, 11-14 Sep 2011
2.Extended coding and pooling in the HMAX model.
Christian Theriault, Nicolas Thome, Matthieu Cord. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2013 (now online)
[pdf]:
DOWNLOAD:
[click here to download MATLAB code]
OVERVIEW:
This project is an extension of the HMAX
model: a neural network model for image classification. This
model can be described as a four-level architecture L1 → L2 →
L3 → L4
with a first level L1 consisting of multi-scale and multi-orientation
local filters which are progressively pooled into a vector signature at level L4 . We improve this architecture by allowing filters on level L3 to
combine multiple scales from the lower levels. The resulting L3 filters provide a better match to image structure
and are thus more discriminant. We also introduce a multi-resolution spatial pooling at level L4 .
This pooling encodes both local and global spatial information
to produce discriminative image signatures. Classification results
are reported on three image data sets, Caltech101, Caltech256
and Fifteen Scenes. We show significant improvements over
previous architectures using a similar framework.
BASIC NETWORK OPERATIONS:
Layer 1 takes the convolution (noted ∗) of the input image I(x,y) with a set of spatial Gabor filters gθ,σ(x,y) with orientations θ∈{θ1,θ2,..,θΘ} and scales σ∈{σ1,σ2,..,σS}. Gabor filters parameters can be chosen to model recording of simple cells activation in the V1 area of the visual cortex. The operation maps the image space (x,y) to a higher dimensional space such that if the image is a real array I∈ℝm×n then layer 1 is a four dimensional array L1 ∈ℝm×n×Θ,S.
      ƒ1 :         ℝm×n   →       ℝm×n×Θ,S
where L1θ,σ=(gθ,σ∗I).
Layer 2 Layer L2 is obtained by taking the convolution of
L1 with a maxium filter max k×k
and downsampling the result.
A well known effect of selecting maxima over local neighborhoods is the invariance to local translations and thereby to
global deformations. If L1 ∈ℝm×n×Θ,S, the reduced layer resulting from maxima selection is thus L2 ∈ℝo×p×Θ,S. where o<m and p<n.
ƒ2 :ℝm×n×Θ,S → ℝo×p×Θ,S
where L2θ,σ=( max k×k∗ L1θ,σ).
Layer 3 For this layer we define a new set of filters Fj ∈ℝa×b×Θ,s, j∈{1..N},a<o , b<p. Layer L3 j   is generated by the convolution of L2 at all valid positions with filter Fj. Note that each filter spans all of the Θ orientations but only covers a subset of s<S scales. Since the convolution is taken at valid positions (i.e no padding) if L2 ∈ℝo×p×Θ,S then L2 ∈ℝq×r×u with q=o-a , r=p-b, u=S-s.
ƒ j3
: ℝo×p×Θ,S → ℝq×r×u
where L3 jσ=( Fj∗ L1θ,σ).
Layer 4 This layer generates the final image signature by selecting the maximum output of each filter Fj across positions (x,y) and across scales σ.
ƒ j4
: ℝq×r×u → ℝ
|
→ |
L4j = max σ,x,y(L3σj(x,y))
|
This is done for all filters Fj, j∈{1..N} to generate a final signature in ℝN
RESULTS:
Classification results in average accuracy on the Caltech101 image set
|
15 images |
30 images |
Our model |
s=1
+ normalized dot product
s=1∪7
+ multiresolution pooling
+ pixel level gradient |
53
58.17±0.48
59.21±0.18
60.1±0.5
68.49±0.75 |
59
63.00±0.9
66.85±1.05
69.52±0.39
76.32±0.97 |
Deep biologically inspired architectures |
Serre et al [1]
Mutch&Lowe [2]
Huang et al [3]
Theriault et al. [4]
Lecun et al.[5]
Lee et al.[6]
Jarret et al. [7]
Zeiler et al.[8]
Fidler et al.[9]
Zeiler et al.[10]
|
35
48
49.8±1.25
54.0±0.5
--
57.7±1.5
--
58.6±0.7
60.5
--
|
42
54
--
61±0.5
54±1.0
64.5±0.5
65.6±1.0
66.9±1.1
66.5
71±1.0
|
BoW architectures |
Lazebnik et al.[11]
Zhang et al.[12]
Wang et al.[13]
Yang et al.[14]
Boureau et al.[15]
Sohn et al.[16]
|
56.4
59.1±0.6
64.43
67.0±0.5
--
--
|
64.6±0.7
62.2±0.5
73.44
73.2±0.5
75.7±1.1
77.8
|
Classification results in average accuracy on the Caltech256 image set
|
30 images |
Our model |
s=1∪7
+ multiresolution pooling
+ pixel level gradient
|
31.23±0.38
40.56±0.28
|
Deep biologically inspired architectures |
Zeiler et al.[10]
|
33.2±0.8
|
BoW architectures |
Yang et al.[14]
Wang et al.[13]
Boureau et al.[15]
|
34.02.1±0.35
41.19
41.7±0.8
|
Classification results in average accuracy on the Fifteen Scenes image set
|
30 images |
Our model |
s=1∪7
+ multiresolution pooling
+ pixel level gradient
|
74.35±0.83
82.94±0.57
|
Deep biologically inspired architectures |
Mutch&Lowe [2]
Serre et al.[1]
|
63.5
53.0
|
BoW architectures |
Lazebnik et al.[11]
Yang et al.[14]
Boureau et al.[15]
|
81.4.1±0.45
80.4.1±0.45
84.3±0.45
|
REFERENCES:
- [1] Bileschi.S Riesenhuber.M Poggio.T Serre.T, Wolf.L, Robust object
recognition with cortex-like mechanisms, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 29, pp. 411-426, 2007.
- [2]Mutch.J and Lowe.D.G, Object class recognition and localization using
sparse features with limited receptive fields, Int. J. Comput. Vision, vol.
80, pp. 45-57, October 2008.
- [3]Yongzhen Huang, Kaiqi Huang, Dacheng Tao, Tieniu Tan, and Xuelong
Li, Enhanced biologically inspired model for object recognition, IEEE
Transactions on Systems, Man, and Cybernetics, Part B, vol. 41, no. 6,
pp. 1668-1680, 2011
- [4]M. Cord C. Theriault, N. Thome, Hmax-s: Deep scale representation
for biologically inspired image categorization, in IEEE International
Conference on Image Processing, 2011.
- [5]Marc'Aurelio Ranzato, Fu Jie Huang, Y-Lan Boureau, and Yann LeCun,
Unsupervised learning of invariant feature hierarchies with applications
to object recognition, Computer Vision and Pattern Recognition, IEEE
Computer Society Conference on, vol. 0, pp. 1-8, 2007.
- [6]Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Y. Ng,
Convolutional deep belief networks for scalable unsupervised learning
of hierarchical representations, in Proceedings of the 26th Annual
International Conference on Machine Learning, New York, NY, USA,
2009, ICML '09, pp. 609-616, ACM
- [7]Kevin Jarrett, Koray Kavukcuoglu, Marc'Aurelio Ranzato, and Yann Le-
Cun, What is the best multi-stage architecture for object recognition?,
in Proc. International Conference on Computer Vision (ICCV'09). 2009,
IEEE.
- [8]M Zeiler, Dilip Krishnan, G Taylor, and Rob Fergus, Deconvolutional
networks for feature learning, IEEE Conference on Computer Vision
and Pattern Recognition, 2010.
- [9]S. Fidler, B. Boben, and A. Leonardis, Similarity-based cross-layered
hierarchical representation for object categorization, in IEEE Com-
puter Society Conference on Computer Vision and Pattern Recognition,
Alaska, USA, June 2008.
- [10]Matthew D Zeiler, Graham W Taylor, and Rob Fergus, Adaptive
deconvolutional networks for mid and high level feature learning, in
International Conference on Computer Vision, 2011, pp. 2018-2025.
- [11]Ponce.J Lazebnik.S, Schmid.C, Beyond bags of features: Spatial
pyramid matching for recognizing natural scene categories, 2006, vol. 2
of CVPR, pp. 2169-2178.
- [12]Hao Zhang, Alexander C. Berg, Michael Maire, and Jitendra Malik,
Svm-knn: Discriminative nearest neighbor classification for visual
category recognition, Computer Vision and Pattern Recognition, IEEE
Computer Society Conference on, vol. 2, pp. 2126-2136, 2006.
- [13]Jinjun Wang, Jianchao Yang, Kai Yu, Fengjun Lv, Thomas Huang, and
Yihong Gong, Locality-constrained linear coding for image classifica-
tion, Computer Vision and Pattern Recognition, IEEE Computer Society
Conference on, vol. 0, pp. 3360-3367, 2010.
- [14]Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang, Linear spa-
tial pyramid matching using sparse coding for image classification, in in
IEEE Conference on Computer Vision and Pattern Recognition(CVPR,
2009
- [15]Y-Lan Boureau, Francis Bach, Yann LeCun, and Jean Ponce, Learning
mid-level features for recognition,in IEEE Conference on Computer
Vision and Pattern Recognition, 2010, pp. 2559-2566.
- [16]Kihyuk Sohn, Dae Yon Jung, Honglak Lee, and Alfred Hero III, Effi-
cient learning of sparse, distributed, convolutional feature representations
for object recognition, in Proceedings of 13th International Conference
on Computer Vision, 2011.