此文来自 UCL 心理学院的页面，本人并未做任何更改，只为保存一份方便查看。
The problem of object recognition
We need to be able to recognise an object from a variety of viewpoints (and in a variety of lighting conditions) even though the retinal images generated by the object can be markedly different. We need to be able to generalise across different instances of an object while at the same time distinguishing between instances. We need to distinguish between allowable and invalid transformations of an object. We need to encode knowledge of how objects can be decomposed into functional units (parts).
Marr and Nishihara’s criteria
How should we evaluate a theory of object perception? Marr and Nishihara set out 5 criteria against which models of object perception should be evaluated.
Accessibility: the ease in which an object description might be derived from image data.
Scope: the range of objects to which the model applies.
Uniqueness: the same object should always result in the same unique description.
Stability: the description should be stable with respect to minor changes in the object.
Sensitivity: the description should allow discrimination between instances.
Any model will need to allow a compromise between the later two criteria.
Templates, Features and Structural Descriptions
The simplest model of recognition is that we have image-based templates in memory and that recognition results from the activation of the template corresponding to the recognised object. However it is unlikely that we could have enough templates in the brain to cover all possible objects and this theory does not explain how we can recognise objects we have never seen before. Breaking the object up into features can reduce the problem. Selfridge’s (1959) Pandemonium system provides a good example. In this system simple feature detectors provide evidence, which is passed onto more complex feature detectors, and the output is the member of the class of objects that is supported most by the evidence in the image. However we need to encode the relationship between elements of the object. The elucidation of the features in an object along with their relationships is called a structural description.
Marr and Nishihara’s approach
Marr and Nishihara (1978) considered what kind of coordinate frame might be utilised to solve the problem of object constancy. They argued for an object-centred rather than a viewer–centred coordinate frame. A description based on an object-centred coordinate frame is unaffected by viewpoint. A modular, hierarchical system allows for generalisation (stablity) and discrimination (sensitivity) by allowing different levels of detail in the description. The process of description requires the hierarchical decomposition of the object into a set of articulated parts each of which has an axis and a point of contact with the principal axis. Limits on the angles of components can encode constraints on articulation. Variation in the length of the axis can encode a variety of object types e.g. varieties of animals. The primitives (elements) of the description are volumetric elements called generalised cones. Recognition proceeds by a process of recovering a description from the image, using that description to index a store of models and then finally arriving at a description that incorporates information from the image and constraints encoded in the model.
Biederman (1987), like Marr and Nishihara, proposes that objects are broken up into parts based on geometrical properties of occluding contours in the image, in particular that parts are defined in relation to sharp concavities on contours. The parts are geometric primitives called geons (geometric ions), which include wedges, cylinders and cones. Again objects are represented as a structural description based on these geometric primitives. According to Biederman the primitives are defined by properties such as collinearity, symmetry and parallelism that are non-accidental properties in that image they are invariant under changes in viewpoint. For Beiderman recognition proceeds directly from image properties without the explicit representation of 3D shape.
Evidence comes from experiments I which parts of a line drawing of an object are obscured. If there is enough information for geons to be identified the object can be recognised easier than if geons are obscured. Problems are that it is not clear whether clear contour information is available. Objects segmentation may be dependent upon learning rather than geometry. It may be difficult represent differences between instances of objects.
Is object constancy based on generating a view independent code or is it based on storing multiple views of an object? The way to investigate this problem is to train observers on a limited view of a novel object and then test recognition at a different viewpoint. Bülthoff and Edelman showed observers views of computer generated “paper clips” taken from two viewpoints. Recognition performance was good for novel views that lay between the learned views (interpolation) less good for views on the same axis but outside the training range and worst for views drawn taken from points along a different axis. They concluded that we store canonical views of objects.
Face perception is concerned with how we recognise individual instances of a type of object. Faces are interesting because they undergo non-rigid (expressions, gestures) as well as rigid transformations and they are used as a channel of communication.
Faces do not show invariance under simple transformations such as image inversion (Yin 1969). A number of studies have shown that upside down faces are difficult to recognise. This is likely to reflect our experience with faces. Diamond and Carey (1986) tested the ablity of expert dog show judges and novices to recognise individual dogs drawn from a particular pedigree breed and found experts were more affected by inversion than novices.
Faces are also difficult to recognise when presented as photographic negatives. This suggests that we do not recognise faces on the basis of image contours, as suggested by Biederman for objects, since the process of negation leaves the position of object contours and the location of features intact. The effects of negation are due to disrupting the shading information in faces and also altering pigmented regions of the face. For example the hair and eyes become white after negation.
Face recognition seems to rely on encoding the configuration of features rather then identifying individual features. This is most clearly seen in the comparison of feature identification in upright and inverted faces. Young, Hellawell and Hay (1987) compared the recognition of parts of facial composites. Parts of inverted facial composites were recognised more quickly when the composites were presented the right way up. The Margaret Thatcher illusion (Thompson, ) also indicates that features are encoded independently in upside-down faces.