Abstract
In this paper, I investigate conceptual categories derived from developmental processing in a deep neural network. The similarity matrices of deep representation at each layer of neural network are computed and compared with their raw representation. While the clusters generated by raw representation stand at the basic level of abstraction, conceptual categories obtained from deep representation shows a bottom-up transition procedure. Results demonstrate a developmental course of learning from specific to general level of abstraction through learned layers of representations in a deep belief network.
Introduction
Deep learning is a state-of-the-art technique in machine learning that is used for training large-scale artificial neural networks with many layers of neurons and millions of connections. It has been successfully applied to many real-world problems, yielding results that are comparable or even superior to other methods that are carefully designed for a specific purpose. Recent applications range from speech recognition (Lal & King, 2013) to language modeling (Le, Oparin, Allauzen, Gauvain, & Yvon, 2013), natural language processing (Collobert, 2011), and image classification (Krizhevsky, Sutskever, & Hinton, 2012). Another field of study that has benefited much from deep learning methods is computer vision. In particular, deep learning has accelerated research in visual representation and categorization. While traditional methods were focused on labeled training data, deep learning offers a powerful alternative that exploits unsupervised learning (Hinton & Salakhutdinov, 2006) for addressing image encoding and visual representation of large image datasets (e.g. Torralba, Fergus, & Weiss, 2008). Deep networks are characterized by a hierarchical architecture inspired by the organization of primate visual system which can be conceived as a cascade of layers with non-linear probabilistic outputs. The hierarchical generative architecture has shown to be very useful for knowledge representation and categorization. It has been argued that deep neural networks can be considered the most promising approach for simulating cognitive abilities (Stoianov & Zorzi, 2012; Zorzi, Testolin, & Stoianov, 2013). The main goal of this research is to investigate the emergence of conceptual categories at different levels of abstraction by using representation learned with deep belief network on a large dataset composed of semantic feature production norms. Key questions regard the nature of categorical representations and the links that might be formed between visually similar but semantically unrelated objects. Other related challenging problems deal with object representations at different levels of inclusiveness (Sadeghi, Nadjar Araabi, & Nili Ahmadabadi, 2015) and the properties within different modalities associated to each concept. It is well known that object perception is a multi-modal experience obtained by information integration from different input modalities such as visual, auditory, and tactile inputs (Lewkowicz & Ghazanfar, 2009). Furthermore, semantic processing may involve different levels of abstraction known as superordinate, intermediate, and subordinate. However, it is unclear how the encoding of semantic information can account for prototypical representations at each level of abstraction. Carlson, Tovar, Alink, and Kriegeskorte (2013) measured the neural activities of participants while viewing objects at different semantic levels and they found increasing abstraction in patterns of responses. Recently, Güçlü et al. found strong response in the downstream areas of cortex for semantically similar object categories (Güçlü & van Gerven, 2015).
In the present study, I explore hierarchical category structure that emerges from deep representation on a subset of production frequency data from McRae et al. study (McRae, Cree, Seidenberg, & McNorgan, 2005).
Dataset
Samples of Visual Feature Dimensions from McRae Feature Dataset.
Samples from Visual-Form-and-Surface Feature Dataset.
Method
According to the recent studies, human perception is constructed through a hierarchical multi-layer structure. This finding has been a source of inspiration for solving many Artificial Intelligence problems by exploiting a non-linear feature hierarchy. Deep learning is the most powerful technique that is developed to simulate the hierarchical representation procedure of human brain (Hinton & Salakhutdinov, 2006). Deep neural networks typically appear in two different classes based on whether or not they use feature localization. One of the most significant examples from the first class is RBM neural networks which aim to reconstruct the input data. In contrast, Convolutional Neural Networks (CNN) are an example of the second class which use local filters in order to extract simple patterns from patches of input data. Therefore, while CNNs take a local approach to carefully extract detailed information, RBM networks develop a high representation of input data in a global and generative manner. The current paper investigates the development of high-level categories and to this end, RBM network can provide a better understanding of the bigger picture about the levels of abstraction in semantic categories within data. I focus on aspect of hidden semantic information at each layer of deep belief network and I question the relation between hierarchical feature construction and hierarchical knowledge organization. Accordingly, I use a three-layer deep belief network which is composed of stacked layers of Restricted Boltzman Machine (RBM) and apply it to the feature dataset that was explained in the previous section. The number of units in the first, second, and third layer is equal to 100, 200, and 300 consequently and they are chosen with respect to the number of training and test examples and the obtained generalization performance. The network is trained based on the practical guide which is provided by Hinton (2012). Each layer is trained with contrastive divergence algorithm which tries to reconstruct input values by performing Gibbs sampling. The weight vectors of the network are adjusted by all the concepts available in the McRae dataset with the learning rate of 0.0002. The momentum coefficient for the first 5 epochs is 0.5 and is changed to 0.9 for the rest of training epochs. Learning is performed in batch mode using 6 mini-batches of 89 cases with respect to the all 534 available concepts in McRae dataset. Deep learning is implemented with GPU programming (Testolin, Stoianov, De Grazia, & Zorzi, 2013). The test data contains 52 concepts and is identical to the item set used by Dilkina et al (Dilkina & Lambon Ralph, 2012). Cosine similarity and correlation matrices across entities are used in order to identify the similar clusters within the visual_form_and_surface dimensionality. The conceptual clusters are compared in two modes: conceptual clusters resulted from raw representation and those obtained from deep representation.
Results
As explained earlier in Dataset section, Visual features in McRae dataset come in three dimensions. I decided to base all my computations on form_and_surface features since this dimension accounts for the main visual differences. Thereafter, I studied the visual features in three modes: {form_and_surface}, {form_and_surface+color}, and {form_and_surface+color+motion}. The purpose behind this approach is the inefficiency of color and motion information to unfold the inherent semantic distinctions. Color information cannot be served as a reliable information for semantic categorization since artificial and natural objects can be found in a variety of colors. In addition, motion information is biased towards moving capabilities and so makes a broad distinction between moving objects (i.e., animals plus vehicles) and other non-moving objects. Therefore, I added color and motion dimensions subsequently to examine if they could assist in providing a more clear understanding of the organization of semantic categories. However, adding these dimensions did not change the main observation and the obtained results demonstrated that the form_and_surface dimension accounts for the clearest hierarchical structure. Hence, the following subsections are merely focused on the results related to the form_and_surface dimension.
Correlation Matrices
One classical approach to study the semantic relationships is through utilizing correlation and similarities between vectors of properties corresponding to each concept. This technique has been used to assess the organization and structure of conceptual categories (Dilkina & Lambon Ralph, 2012; Sadeghi, McClelland, & Hoffman, 2014). First the similarity matrix between all pairs of items is calculated and then the correlations are calculated and visualized. This procedure is repeated at each deep layer. The corresponding results for the visual form_and_surface dimension are depicted in Figure 1. The plots are illustrated for raw and deep representations. The raw representation refers to the results achieved directly from normalized features as provided by McRae, whereas the deep representation indicates the representation obtained through deep learning. As can be understood from the figures, raw features generated clusters at the basic level of abstraction. In contrast, deep representation specifies the developmental transition between different levels of abstraction from specific categories to more general concepts. Most notably, in deep mode, the number of observed clusters from the first layer towards the third layer decreases. In addition, category formation follows a developmental structure, i.e., the categories start at subordinate level of abstraction at the first layer and ends in superordinate categories at the third layer.
Correlation plots corresponding to {form_and_surface} features for (a) raw representation, (b) deep representation at layer 1, (c) deep representation at layer 2, and (d) deep representation at layer 3 (see online version for color figure).
As mentioned earlier, the visual form_and_surface dimension includes information about visual appearance of objects. The detectable categories through this type of features at the first deep layer (as can be inferred from Figure 1) are terrestrial animals, birds, fruits, vehicle, and miscellaneous artificial objects, on one hand. In the second layer, birds and terrestrial animals merge together and develop the general category of animals. Finally, in the third layer, a more general picture is observable: on one hand, animals create a more coherent cluster, on the other hand, fruits, vehicle and other artificial objects are integrated together. Hence, superordinate categories of living and non-living things are differentiated at the third layer. One exception is airplane which is clustered together with birds. Inspecting the visual form_and_surface features of this entity shows that it has a strong connection with birds through “wing” property which has been strongly rated for airplane object.
In order to provide a better understanding of the level of inclusiveness in each deep layer of representation, a statistical analysis based on the average correlation is performed. To this end, the correlation matrices are divided into four partitions to evaluate the amount of relationship within and between superordinate categories (i.e., living and non-living items). Accordingly, the average of correlation values in each partition is calculated using bootstrap method and t-test is performed to determine whether the difference between average values is significant or not. The obtained results are presented by the correlation graph in Figure 2. As can be inferred from this graph, while average within-correlation coefficients go up from layer 1 towards layer 3, average between-correlation coefficient drops. The results illustrates that generality of conceptual categories increases during development of concept representations through layers of deep neural network. Another words, the conceptual differentiation of animal and non-animal (or living and non-living) categories is observable at higher layers of deep representation.
Correlation graph. Average correlation within and between group of animals and non-animals for {form_and_appearance} dimension using bootstrapping. Basic: raw mode, L1: deep mode-layer 1, L2: deep mode-layer 2, L3: deep mode, layer 3. Paired sample t-test is used to measure the difference values at each level (*p < .0001).
Singular Value Analysis
In this section, I apply singular value secomposition (SVD) method to the correlation matrices obtained from deep representation at each deep layer. This method decomposes an input matrix M into three matrices in a linear manner:
The first three singular values obtained from applying SVD to the third layer of deep representation based on {form_and_surface} features.

Discussion and Conclusion
The hierarchical representation offers a number of advantages such as generalization, efficacy of computational processing and memory exploit, as well as producing more complex information (Kruger et al., 2013). There is evidence that mammalian visual system is composed of different regions which are elaborated in a hierarchical organization in terms of complexity. The hierarchical structure of receptive fields in the visual cortex was first identified by Hubel and Wiesel (1962). Their finding indicated that neurons along the ventral stream are arranged within a simple-to-complex structure. This pioneering work has been regarded as the principle behind designing biologically plausible learning methods that mimic the visual cortex structure (Fukushima, 1980; Marr, 1982; Serre, Wolf, Bileschi, Riesenhuber, & Poggio, 2007). In particular, deep learning can be considered as the most successful attempt to emulate learning of hierarchical representation in brain. Deep neural networks leverage a hierarchical architecture which is composed of multiple layers of non-linear processing that produce complex features based on the combination of low-level information. The multi-layer structure of deep networks allows them to learn a distributed representation of input data. The elicited representation evolves into a higher generalization layer by layer. In this sense, higher layers encompass a more abstract code of input data. The question that is addressed in this research is the relation between levels of representation and levels of abstraction. To this aim, I used a deep belief network composed of three stacked Restricted Boltzmann Machines. Each RBM create a generative model of data by using the joint energy distribution between visible and hidden layers (Hinton, 2012). The output of each RBM is served as a new input for the next RBM in the sequence of layers. By using a greedy layer by layer pre-training of RBMs, the deep belief network arrives at a rich representation of the distribution of data. In the current paper, I study the transition of representation at deep layers of deep belief network and its association to taxonomic categories. Many studies have shown that infants are much better at recognizing more inclusive categories than items from less inclusive sets (Mandler & McDonough, 1998; Pauen, 2002). This may promote the hypothesis that more abstract information are developed faster and the attention towards fine distinctions emerges later. Saxe, McClelland, and Ganguli (2013) analyzed the learning dynamics in a deep neural network and found that high abstract concepts emerge before less abstract ones during the course of supervised learning. In this paper, I argue that once a deep belief network is learned in an unsupervised manner, the more abstract conceptual categories are present in the higher layers. In particular, this study is concerned with the conceptual categories that arise at layers of hierarchical neural networks. Using a three-layered deep belief network and similarity analysis, I demonstrated the fine-to-coarse structure of developmental learning from deep representation of behavioral features of concepts. My investigation shows that deep learning follows a bottom-up approach to reach out the taxonomic levels of concepts and the most abstract concepts are present in the higher layers of representation. Literally, the results indicate that the most general distinctions are represented most saliently in the highest layers of deep network and hence there is a progression in depth, where low-level layers represent finer distinctions and high level layers represent coarser distinctions.
Footnotes
Acknowledgements
The author would like to thank Professor James L. McClelland for all his support and Dr. Andrew M. Saxe for his useful comments
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
