As a raw list of materials, the content can be viewed in the following timeline: Introduction/ goal and difficulties of computer vision -> pre-process of image, including some useful MATLAB functions such as rgb2gray, im2double, etc. -> filtering, including different types of noise and filter techniques, correlation/averaging/Gaussian/sharpening filter -> seam carving (first introduction of dynamic programming) -> edge detection/gradient, feature detection -> fitting, grouping detected lines into models/ least squares, RANSAC -> image matching, local feature extraction -> camera system, homogeneous coordinates and mapping between dimensions (single view geometry, calibration matrix) -> epipolar geometry, two camera system/ eight-point algorithm -> stereo, including correspondence matching algorithms, rectification, introduction of structured light (second introduction of dynamic programming in scanline stereo) -> object recognition/detection, including face detection and image classification, BOF model, SPM model, SVM, cascaded classifier, rectangle filters (integral image).
The detailed contents listed above could be concluded into four general blocks: filtering, edge detection and gradient related topics, stereo and view related topics, and feature recognition/detection. The rest of this article would unfold each topic and describe the concepts in detail.
1. Image filtering
Filtering is the most underestimated important phase in image processing. It is necessary because in many cases, images need to be pre-processed in order to complete further complicated operations. For example, to achieve efficiency by reducing useless information contained in the pixels would require a smoothing filter. Filters can also modify the image to achieve certain visual effects, especially useful in noise reduction. To start with, we assume the image consists of a finite number of pixels, which are stored in a matrix, or layers of matrices, and the visual-related information is represented as the pixel values, such as brightness or colors.
The most commonly used type of filter is the linear filter. This is a very broad class, and by name it performs linear manipulations on pixels. For a specific pixel, it applies a mask to a certain range, usually n x n, centered on that pixel. The values of this n x n masked neighbors are taken as inputs for the filter. The output from the filter then becomes the value of the centering pixel.
Such operation is repeated for each single pixel in the image. When the filter mask encounters boundary lines and corners, optional choices are applicable: it can assume pixels outside boundary with value 0, or pretend those are the reflection of inside pixels, etc. The choice depends on practical situations.
The focus is the manipulation of neighboring pixel values. A simple filter would be a running average filter—which changes the centering pixel value to the average of all neighboring pixels. Not surprisingly this produces a smoothing effect. A more sophisticated averaging filter would apply different weights to neighboring pixels based on their distance from the center.
The most famous and widely used filter would be the Gaussian filter. It is most widely used because Gaussian noise is the most prevalent noise in images, and Gaussian filter could effectively reduce this type of noise by smoothing the image. It basically acts as a low pass filter that takes a standard deviation value as input to convolute the masked pixels. There are also other types of noise occurring in image: salt and pepper noise, impulse noise. One can remove them by applying the median filter, which prevents drastic pixel value change. The Gaussian and averaging filters smooth the image. On the contrary of the smoothing filter, there is also a sharpening filter. It achieves the sharpening effect by highlighting the centered pixel and subtracting average value of its neighboring pixels. This filter is often used as a restoration of blurry-processed images.
Another different type of filter is difference filter. It calculates the derivative of pixels in x or y directions of the image, which is in turn related to the gradient of the image. Famous ones such as Prewitt, Sobel, Roberts are widely used in calculating gradient components, which is related to the next topic, edge detection.
Side topic: Seam carving
A technique used to smartly crop images, removing a large patch of repetitive texture while keeping useful objects. The algorithm could be described in following steps:
1. Calculate gradient magnitude for each single pixel. This is used as the cost of pixels.
2. Starting from the top, use dynamic programming to find a seam path with the lowest cost. The cost of a pixel is the cost of itself plus the minimum cost of the three neighbors on top of it.
3. Remove the low-cost seam paths.
4. By using DP method, this algorithm is efficient in speed.
2. Edge and gradient related topics
Edge detection:
An edge in the image is represented as the boundary of an object. It can be distinguished from another context by recognizing a sudden change (discontinuity) in the pixel intensities. The Canny Edge detector is the most widely used edge detection method. It can be implemented in the following steps:
1. pre-process the image with a smoothing filter to reduce noise, removing edge-irrelevant discontinuities.
2. Apply difference filters on both x and y directions to calculate derivatives, and combining them to get the gradient magnitude and angle of the image.
3. Normalize the angles to a set of sections. E.g., using 45 degrees as a section.
4. Apply non-maximum suppression to the image gradient: to the norm direction across the edge, only one pixel that has the maximum gradient magnitude out of three neighbors is kept as the edge pixel.
5. Eliminate furious edge curves by applying hysteresis thresholding: use a high threshold to start the edge curves, and a low threshold to continue them.
Fitting:
The edge detection is the first major leap in image processing. With the edge, one can build up a model that is similar to the human perception of an image. To achieve this, we need to use a fitting method to construct the contours above the edges and make simple models that represent daily objects. The goal of fitting is to find the optimal line parameter that connects the detected feature points and omitting the outliers. Two fitting methods are emphasized during the lectures: least squares and RANSAC.
Total Least Squares (TLS) is a rotation-invariant version among multiple versions of the least square method. If the line is represented as ax + by = c, the algorithm tries to minimize the sum of the normal distance from given feature points. By finding the roots of the derivative of the summation equation, the parameters can be calculated from the generated moment matrices.
Although invariant to rotation, however, TLS suffers from outliers in the data. Random Sample Consensus (RANSAC) is used as a robust algorithm against outliers. It can be described by following steps:
1. Randomly select a minimal set of points from the data points.
2. Hypothesize a model, and compute the error function.
3. Select points that fit the model, usually by applying a threshold.
4. Repeat step 1-3 until reached number of iterations.
This method requires tuning with plenty of parameters.
Feature detection/extraction:
The goal and reason for feature detection are to find elements present in the image that are invariant to both geometric and photometric transformations, such that occurrence of certain features in multiple images can be recognized, in order to identify the relation between such set of images or objects. The criterion to pick out feature is “unusual” and “unique”, which ideally maximizes its specialty within the image. Usually, the best case is the “corner” patches, where pixel values change extensively across all directions. Bad examples would be flat regions that are textureless or any segments that have no obvious discontinuities in one direction. This criterion metric could be expressed by a shift error between the patches: the SSD (Sum of Squared Differences) between the original feature patch and the shifted patch. By applying small motion assumption, one can deduce the formula to maximize the error for small shifts in all directions. This is achieved by calculating the Harris operator. More sophisticated feature extraction methods include Multi-Scale Oriented Patches and Scale Invariant Feature Transform (SIFT), which will be used in the further implementation.
3. Geometry related topics
Pinhole Camera Model
The pinhole camera model is a simple and effective single-view model to describe the sampling process of a camera system. For mathematical simplicity, the image sensor is placed behind the aperture on the focal axis. Since this is a single-view system, it reduces the real world dimension from three to two onto a plane, as well as losing information about angle and depth. Many points along the same visual ray map to the same point from the perspective of the camera. The origin of the camera coordinate frame is at the principal point, where the focal axis intersects with the image plane. The homogeneous coordinate system is introduced and perfectly solves the non-linear issue due to focal length. A calibration matrix based on the focal length and principal point is combined with the pixel magnitude on x- and y-axis to convert the real world coordinate frame to the camera frame.
A single-view system, however, as mentioned above, is not enough to represent the vivid content in a 3d world. Thus the multi-view system is introduced, and we mainly focus on the two-view system, with epipolar geometry as an example. The setup for epipolar geometry consists of baseline, epipolar plane and epipoles (intersection of baseline with image planes), and epipolar lines. The epipolar geometry imposes the problem of matching points between two epipolar planes. The constraint is that potential matches of certain point x from the left plane have to lie on the corresponding epipolar line of the right plane, vice versa. The concept of fundamental matrix relates points from left/right plane to the other image plane. The eight-point algorithm is introduced to calculate the matrix.
The most significant topic of geometry would be stereo correspondence matching algorithm. The algorithm takes two images taken by left and right cameras as input, and uses the information contained in the distance differences between the images to calculate more information in the 3d world, basically generating a depth map. First, the algorithm finds the distance between matching pixels from left and right images. The actual depth would require extra knowledge about camera parameters, by using similar triangles and other geometric manipulations on calculated distance. For most cases the relative depth is calculated, simply taking the absolute value of the pixel distance. There are extensive researches and algorithms on this topic. The images should be pre-processed and aligned with epipolar lines first and should satisfy three non-local constraints: ordering, uniqueness, and smoothness. The classic method is finding matching patches with minimum SSD on each scan line. It is not very efficient and cannot correctly find matches for textureless regions. A slightly better modification to this algorithm would be removing occlusion and erratic values, and replacing those pixel values by applying some neighboring analysis, such as averaging. Another more advanced algorithm would be applying dynamic programming. This is the main method implemented in this course. It could be described as following steps:
1. For each scan line, initialize a cost matrix and three tracing-back matrices; determine the occlusion cost value based on the average pixel intensity of the image.
2. Recursively define the matching cost for each pixel on the scan line. Use SSD for patches matching cost function.
3. For three cases, left occlusion, right occlusion, and matching, store the path information into the tracing-back matrices.
4. After the cost computation is finished, retrieve the calculated path and compute the difference between matching pixels as depth.
4. Object recognition/detection
The history of object recognition is briefly introduced. The difficulty of object recognition is that there exist 104 order of magnitude different visual objects; to correctly classify this number of objects would require essentially enormous computational resources. This is the recognition side of the problem. On the detection side, a naive approach is applying sliding window. This is also proving hard due to the pixel numbers on a single image. The false positive rate should be decreased to an extremely low level as well. This requires the usage of cascaded classifier, which will be described later.
Bag-of-Features Model
The algorithm of BOF could be described by following steps:
1. Collect a large data set of different objects to be classified.
2. Extract features from the training images for each class and put them into a feature pool. Features can be SIFT or SURF features.
3. Run clustering algorithm to define the “visual words” in feature vector space.
4. Generate a histogram for each class, which contains the percentage of the features appearing in that class relative to the general feature pool.
This algorithm is comparable to the text classification method used in natural language processing and sometimes is also known as “Bag-of-words” algorithm. The disadvantage of this algorithm is that it requires a long time to run for bigger scale classes, and the average recognition rate for big scale classes is not very high.
The method of clustering is also introduced in detail. The method we use here is k-means clustering. The goal is to minimize the sum of squared Euclidean distances between each feature descriptor point and its nearest cluster centers. The algorithm runs as below:
1. Randomly select/initialize K cluster centers.
2. Iterate until convergence: assign each data point to the nearest center; recompute each cluster center as the mean of all points assigned to it. Stop when cost decreases is not significant.
BOF can also be used for action recognition, which simply extends image frame to video, with a higher dimension codebook of features.
Spatial Pyramid Matching algorithm is an extension of BOF model. It divides the image into different section levels with locally orderless features, and compare testing image with training sets using support vector machine (SVM).
The nature of SVM is to find a hyperplane that maximizes the margin between positive and negative data sets. Those data points closest to the margin are so-called support vectors. This can be transformed into an optimization problem. It can also be extended to a nonlinear case by some manipulations, such as coordinate change, or map to a higher dimension space.
Now let us come back to the face detection problem, which requires a false alarm rate less than 10-6. One approach is to apply a cascaded classifier. It contains K levels of separate classifiers, and each level runs fast enough, with a not very accurate result. However, after cascading them it produces a satisfying result, in a reasonable amount of time. Another approach would be decreasing computational complexity by reducing the number of multiplication and addition. This can be achieved by integral image, also known as rectangle filter. Every pixel value (i,j) of an integral image is the sum of the pixel value of the rectangle bounded in its top-left region. It only requires three additions for an arbitrary region and greatly reduces the computation complexity of sliding window method.
I would like to spend some discussion on the topic of object recognition/detection. This topic is composed by several sub-topics, including the history of recognition ideas, difficulties and some approaches to solving the problem.
Here is a brief description about the terminologies later will be used in introducing historical ideas. Alignment is the matching between distorted/transformed image pairs. The transformation could happen in many ways, such as rotation of the object, or different view angle on the object. Alignment is to apply some geometric tricks on the images to reorganize them into a state that is invariant to such distortion. In-class variation refers to the difference between the appearance of objects inside a single class. For example, there are all different kinds of books, with different colors, covers, and sizes, but they are all characterized as a book. Under certain cases, this variation is so huge that literally there is no way to recognize them unless they are incorporated into the original database. The geometric property of an object is a set of parameters that are related to its size and shape, geometrically; the photometric property is the color, lighting and intensity information of the object in an image. A patch is simply a block of pixels in the image. A feature descriptor is a vector that contains values with information describing the property of a certain pixel or patch in an image.
The original thoughts view recognition as an alignment problem. The shape of the object is assumed known, and the goal of the algorithm is to eliminate the effect caused by camera position or illumination. That is, for the same object, the image taken by a camera at the front at noon and one taken by a camera at a side at night should produce same recognition results, given geometric properties of that object. This is a very rudimentary recognition algorithm because the application is such limited that it cannot even deal with in-class variation--two different shaped chairs would be treated as different objects because there is no way to model a geometric, or photometric transformation between them if they have different shapes.
In late 80’s, a new method to recognize objects by their components came out. Instead of recognizing the object as a whole, it would recognize some primitive geometric shapes at first, and then combine these shapes to produce a prediction. Later come appearance-based techniques. These methods try to recognize/classify objects based on their intrinsic properties, such as color. The algorithm takes the color information stored in pixel values and puts them into a high dimensional vector space, which forms a specific color pattern for that object. The same procedure is repeated on other images to compare and predict objects.
In mid 90’s, sliding window emerges as an approach for object detection. Given an image, a window with certain size n x n would slide from the top left corner of the image to the right, then start from the left of the next pixel line and repeat until it reaches the bottom right corner. The window may vary in the reasonable size range for detection because it does not know the actual size of possible objects in the image. Each windowed patch is examined by comparing itself with the sample patches and can be determined whether it is the object based on the error value. The major constraint for this approach is the number of windows or patches used during detection. Since the varying-sized window slides across the entire image, a normal picture would require over 108 order of magnitude windows to do a full scan. This severely restricts sliding window as a detection algorithm due to its high computation complexity, as well as the incredibly low demand on a false positive rate. Some methods are devised to improve the algorithm, such as cascaded classifier and integral image described above.
Later on, people starts to focus on extracting features from objects within the image. The most intuitive thought would be recognizing the shape of a certain object. This is applicable because the edge detection method was developed then, and the contour of objects could be easily determined with the help of fitting algorithms (fitting: with a given set of data points, find a line or curve that best represents their geometric distribution). The deduced contour for certain class can be directly applied to input images for object recognition, but additional feature patches are also used to increase the recognition accuracy. Generally, the candidate matching is
determined based on the similarity between feature descriptors extracted from input images and those from database pool.
In early 00’s people tend to turn back to late 80’s approach. A more advanced version with “parts-and-shape” models came up. This model is made of three parts: a database containing small and simple objects that are easy to recognize/detect; the information about the relation between those object parts, mainly geometry, and positions; the general appearance of different parts as a whole. This is very similar to the “recognition by components” approach.
The modern object recognition/image classification algorithm is derived from a text-processing algorithm, called “Bag of Words” (BOW). Imagine a bunch of books with different topics, and the goal is to train the computer so that given it a new book as input, it can classify the new book as about certain topic. The algorithm does not relate to complex probability methods or fancy terms such as Bayesian, or Latent Dirichlet Allocation; it simply takes account of all possible word occurrences. First, the algorithm extracts all words from the set of books and then cluster them to produce a vocabulary with fixed number of words that represent each different topic. And for books of each topic, it calculates the word frequencies of the vocabulary. When a new book is given, its word frequency is also calculated and compared with that of each topic. It would then be classified as the topic with most similar word frequency. The BOF method is exactly the same except words are replaced by “visual words”, which are basically feature descriptors, and the dictionary is replaced by a “codebook” which contains generalized descriptors.
Next is a brief walk through of BOF (Bag of Features) image classification algorithm.
To begin the description of BOF model would require an introduction and walk-through of computer vision terminologies. As all other image classification/object detection algorithms do, BOF also requires a training data set to setup the database. The training data is the set of images that is used to teach the algorithm about each image class. An image class consists of an appropriate number of images that share the same feature/object. For example, a set of images of different breeds of dogs would make the dog class. The number of training images is also an important factor for classification accuracy--too few images would result in the lack of features for prediction, and too many images would result in high computation complexity. To examine the accuracy of certain recognition method for further usage, the training and testing data set is usually split by ratio 3:7 within the given data pool.
Given a set of image classes, with the goal of learning their features, an intuitive approach is to keep a record of the specialty for each class. In plain language, this is analogous to how human perceive, or classify scenes--schools all have buildings with many windows, restaurants have lots of tables and chairs, the beach has sand and coconut trees… Whenever we see similar scenes in daily life, we tend to recognize them as corresponding scenes we have seen before. This is exactly what our computer tries to do. The first problem is how to pick out the relatively important and useful features from images. A good guess would be featuring pixels or patches that only possessed by one particular class. These feature descriptors have to satisfy the “uniqueness” constraint. There are two types of uniqueness: local and global. A good feature descriptor satisfies both local and global uniqueness, as well as invariant to any kind of distortion, both geometrically and photometrically. A common way to describe this invariance is by listing a bunch of distortion factors--the feature descriptors should be robust enough that the light on the object or the direction of the camera when taking the picture can produce no effect on them. As for the uniqueness, pixels on corners are good choices--at least they are locally unique. A “corner” can be thought as the intersection of edges, such as one right angle of a table, or the collar of a shirt. The detected features are normalized, in a sense that discards information about lighting and orientation, then transformed into a vector by some encoding method.
The extracted feature descriptors from all training images are stored in a giant feature pool for the next step, clustering. The descriptor vectors can be imagined as a set of points floating in a high dimensional space (128 for SIFT, 20 for PCA (principle components analysis) reduced, and 64 for SURF). There are too many data points for further process and computation; for a task to recognize 20 image classes, there would be orders of 106 points in space. Many feature points are also similar, in the sense that they are very close spatially. Clustering is one way to reduce unnecessary data points while keeping accuracy. It gathers similar data points together and merges them into one cluster center, which represents all of its closest neighbors. The method used in this course is k-means clustering, where k stands for the number of clusters chosen, and mean represents the method to update cluster centers--averaging values of all points represented by that cluster center. The algorithm first selects randomly k cluster centers among millions of data points in space. Then it enters a giant loop. Each of the rest of points in space is assigned to the nearest center selected in the first step. “Nearest” means literally closest--the distance metric here is Euclidean, which is just the intuitive spatial distance. “Assign” means classify the point as the same set of the center; to help understand this, you could imagine the points are marked by different colors, where each color represents a cluster. The cluster center is the original owner of that color. But the cluster centers are randomly selected in the first place, and this would not produce a good result. The coordinate or spatial location of these cluster centers are then recomputed by taking the mean of all points assigned to it. This whole process after a random selection is repeated until the location change of cluster centers falls below a threshold. This means the cluster centers are accurate and representative enough that no further averaging is necessary. The cluster centers now represent the standardized “visual word” pool in the codebook. That is, any feature descriptor from the training set can be viewed as one of the cluster center (its closest neighbor), and this provides convenience for later comparison, because unlike words, features in images are not discrete and absolute. The clustering discretizes the feature descriptors and makes it possible to match and compare these “visual words” from different images.
The “visual word” frequency is then generated for each image class. Each image class contains feature descriptors from all images within that class, and they are all classified by nearest cluster centers. For example, an input set of random vectors from -1 to 10 comparing with cluster centers [1,2,3] would result in a histogram counting number of occurrence of [1,2,3] in the input set. Those values that are not [1,2,3] in the input set are still counted as [1,2,3] by finding closest neighbors--4 - 10 all count as 3, while -1 and 0 are counts as 1. This example illustrates the notion of “standardization”.
With the generated histograms of features for each image class, the training phase is finished; the visual word frequency is calculated for each class, as word frequency for each topic is calculated. Next is the testing phase. The testing has to strictly follow the parameters or metric choices in training phase because this is a strict regularization on any train-test pairs. In this way, the input image is taken, with feature descriptors extracted by the same method used before (SIFT, SURF). The descriptor points are then compared and matched with cluster centers, just as training class points. The generated histogram count of matching frequencies for each cluster center is the feature representation of that image. This histogram is then compared with histograms of each class generated during the training phase, and the class with most similar one is the predicted class for the image. The similarity between histograms can be evaluated by metrics on users own choice--both correlation and Euclidean works well.
One apparent and a common concern is that the cluster centers are generated from a much bigger number of images than the testing image--the number of feature descriptors from one single image is only on the order of magnitude 102. When comparing the descriptors with the class histograms, there could be thousands of missing bins. Surprisingly this is not a problem. For a specific image closely related to one specific class, or even multiple classes, the descriptors about one certain class would take the majority of occurrences, while the rest would be trivial, mainly errors. The similarity between the histograms can be determined easily by the major components, not the trivial ones.
The BOF method suffers from bias. One is biased from a number of descriptors. Certain image class may contain extremely rich and vivid patterns than other ones. One example is butterfly and refrigerator; the wings of butterfly contain an enormous variety of patterns, while the surface of the refrigerator is almost textureless, plain white or black. The feature descriptors extracted from butterfly class can be 103 times greater than refrigerator class. This could result in severely unbalanced clustering. One method is to impose a limit on a number of feature descriptors for each class, suppressing the active ones. Another source of error is covariance--where two objects are in nature difficult to tell apart, such as butterfly and bat; they sometimes deceive even human eyes. Computers are harder to do better than human in this recognition case, and the only thing to lower the error rate, in this case, is to reduce the feature descriptor dimension.
There are many extension of BOF algorithm, and the main modification is about feature extraction and clustering part. Hopefully, this brief summary above walks you through the basics of modern object recognition/image classification.