Biologically inspired image representation





Christian Theriault   ,   Nicolas Thome   ,   Matthieu Cord

Universite Pierre et Marie Curie, UPMC-Sorbonne Universities, LIP6, Paris, France

PUBLICATIONS:

1.HMAX-S: Deep scale representation for biologically inspired image categorization. [pdf]: Christian Theriault, Nicolas Thome, Matthieu Cord. ICIP 2011, p 1261-1264, ISBN: 978-1-4577-1304-0, Brussels, 11-14 Sep 2011

2.Extended coding and pooling in the HMAX model. Christian Theriault, Nicolas Thome, Matthieu Cord. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2013 (now online) [pdf]:

DOWNLOAD:

[click here to download MATLAB code]

OVERVIEW:

This project is an extension of the HMAX model: a neural network model for image classification. This model can be described as a four-level architecture L1 L2 L3 L4 with a first level L1 consisting of multi-scale and multi-orientation local filters which are progressively pooled into a vector signature at level L4 . We improve this architecture by allowing filters on level L3 to combine multiple scales from the lower levels. The resulting L3 filters provide a better match to image structure and are thus more discriminant. We also introduce a multi-resolution spatial pooling at level L4 . This pooling encodes both local and global spatial information to produce discriminative image signatures. Classification results are reported on three image data sets, Caltech101, Caltech256 and Fifteen Scenes. We show significant improvements over previous architectures using a similar framework.


Network overview

BASIC NETWORK OPERATIONS:

Layer 1 takes the convolution (noted ∗) of the input image I(x,y) with a set of spatial Gabor filters gθ,σ(x,y) with orientations θ∈{θ12,..,θΘ} and scales σ∈{σ12,..,σS}. Gabor filters parameters can be chosen to model recording of simple cells activation in the V1 area of the visual cortex. The operation maps the image space (x,y) to a higher dimensional space such that if the image is a real array I∈ℝm×n then layer 1 is a four dimensional array L1 ∈ℝm×n×Θ,S.

      ƒ1 :         ℝm×n   →       ℝm×n×Θ,S

I
 
L11,1
L1Θ,S
 

where L1θ,σ=(gθ,σ∗I).


Layer 2 Layer L2 is obtained by taking the convolution of L1 with a maxium filter max k×k and downsampling the result. A well known effect of selecting maxima over local neighborhoods is the invariance to local translations and thereby to global deformations. If L1 ∈ℝm×n×Θ,S, the reduced layer resulting from maxima selection is thus L2 ∈ℝo×p×Θ,S. where o<m and p<n.

ƒ2 :m×n×Θ,S → ℝo×p×Θ,S
 
L11,1
L1Θ,S
 
 
L21,1
L2Θ,S
 

where L2θ,σ=( max k×k L1θ,σ).

Layer 3 For this layer we define a new set of filters Fj ∈ℝa×b×Θ,s, j∈{1..N},a<o , b<p. Layer L3 j   is generated by the convolution of L2 at all valid positions with filter Fj. Note that each filter spans all of the Θ orientations but only covers a subset of s<S scales. Since the convolution is taken at valid positions (i.e no padding) if L2 ∈ℝo×p×Θ,S then L2 ∈ℝq×r×u with q=o-a , r=p-b, u=S-s.

ƒ j3 :o×p×Θ,S → ℝq×r×u
 
L21,1
L2Θ,S
 
 
L31j
L3uj
 

where L3 jσ=( Fj L1θ,σ).

Layer 4 This layer generates the final image signature by selecting the maximum output of each filter Fj across positions (x,y) and across scales σ.

ƒ j4 :q×r×u → ℝ
 
L31j
L3σj
L3uj
 
L4j = maxσ,x,y(L3σj(x,y))

This is done for all filters Fj, j∈{1..N} to generate a final signature in ℝN

signature =
 
L41
L42
L4j
L4N
 

RESULTS:

Classification results in average accuracy on the Caltech101 image set
15 images 30 images
Our model
s=1
+ normalized dot product
s=1∪7
+ multiresolution pooling
+ pixel level gradient
53
58.17±0.48
59.21±0.18
60.1±0.5
68.49±0.75
59
63.00±0.9
66.85±1.05
69.52±0.39
76.32±0.97
Deep biologically inspired architectures
Serre et al [1]
Mutch&Lowe [2]
Huang et al [3]
Theriault et al. [4]
Lecun et al.[5]
Lee et al.[6]
Jarret et al. [7]
Zeiler et al.[8]
Fidler et al.[9]
Zeiler et al.[10]
35
48
49.8±1.25
54.0±0.5
--
57.7±1.5
--
58.6±0.7
60.5
--
42
54
--
61±0.5
54±1.0
64.5±0.5
65.6±1.0
66.9±1.1
66.5
71±1.0
BoW architectures
Lazebnik et al.[11]
Zhang et al.[12]
Wang et al.[13]
Yang et al.[14]
Boureau et al.[15]
Sohn et al.[16]
56.4
59.1±0.6
64.43
67.0±0.5
--
--
64.6±0.7
62.2±0.5
73.44
73.2±0.5
75.7±1.1
77.8



Classification results in average accuracy on the Caltech256 image set
30 images
Our model
s=1∪7
+ multiresolution pooling
+ pixel level gradient

31.23±0.38
40.56±0.28
Deep biologically inspired architectures
Zeiler et al.[10] 33.2±0.8
BoW architectures
Yang et al.[14]
Wang et al.[13]
Boureau et al.[15]
34.02.1±0.35
41.19
41.7±0.8



Classification results in average accuracy on the Fifteen Scenes image set
30 images
Our model
s=1∪7
+ multiresolution pooling
+ pixel level gradient

74.35±0.83
82.94±0.57
Deep biologically inspired architectures
Mutch&Lowe [2]
Serre et al.[1]
63.5
53.0
BoW architectures
Lazebnik et al.[11]
Yang et al.[14]
Boureau et al.[15]
81.4.1±0.45
80.4.1±0.45
84.3±0.45

REFERENCES: