Figure: Head of a robot endowed with visionI'm a lay artificial intelligence researcher. I occasionally get interested in machine vision. At various times, I've searched online for information about the subject, but it can be difficult. There are quite a few sites that have links to other sites. And there are conferences, commercial products, and other such things that can be found. But it's hard to find much to bring it all together.
This page sets out to bring some of these resources together. I don't have the time to create a truly exhaustive resource. My hope, then, is to make this a decent starting point for others looking for background information. To that end, I'm organizing information by category and trying to provide summaries, speculations, and other opinions.
One other goal I have for this page is to demystify machine vision. The popular press has a habit of making the products of machine vision research look far more impressive than they often actually are. Even someone like me with a little knowledge of how a lot of the techniques work can easily be fooled by a new trick into thinking the field is much farther along than it really is. And for various reasons, the web sites of research projects or commercial products don't often reveal much about the techniques used. Admittedly, some of my explanations are speculations based on what I find online and sometimes just reckoned by asking myself, "how would I do that?"
I invite you to let me know of your own work. I especially welcome information about current and historically significant research projects, but I also welcome information from the private sector about new or significant products under development or in use today. And feel free to let me know if you find any of my explanations is inaccurate or incomplete. Send me email about your projects, products, and thoughts.
What is Machine Vision?
|Figure: Facial measures used|
in a biometrics vision system.
"Machine vision" is a field of study and technology whose goal is to endow machines with the ability to perceive selective aspects of the world using visual means.
I apologize if this sounds like a circular definition. One can easily get lost in a particular concept or technology when trying to define machine vision. Perhaps it's best to start with a more ostensible definition, then. Those of us fortunate enough to have functional eyes have an incredible ability to perceive and understand the world through them. Engineers have long sought to endow machines with this same capability. It's easy to assume that this just means duplicating the mechanisms people use in machines, but that's not all there is to it. Some techniques involve projecting and reflecting laser beams off distant targets, for example, which is very different from how you and I work. Some systems can read and understand information in bar codes or other special constructs that are difficult for humans to deal with.
Most importantly, few techniques being researched or in use today really resemble the awesome complexity and flexibility available to humans. We MV researchers have our own bag of tricks. It may be that some day, we bring all those tricks together and find we can make machines "see" as well as or even better than humans do, but we're no where near there, yet.
All practical machine vision systems in use today exist for their own specific purposes. Some are used to ensure that parts coming off assembly lines are manufactured correctly. Some are used to detect the lines in a road for the benefit of cars that drive themselves. Though some interested parties claim otherwise, there are no general purpose vision systems, either in laboratories or on the market.
If it sounds like it's difficult to define machine vision, don't fret. The point is that the field of machine vision is not simply interested in duplicating human vision. What is essential is the basic goal of visual perception; of the ability to "understand" the world visually, sufficiently for moving about in and interacting with a complex, ever-changing world and discerning information in the environment essential to the core goals of consumers of these faculties.
As mentioned above, all practical machine vision end products available now are for specific purposes. I contrasted that with general purpose machine vision. Let me define what that means, then.
I'll start, again, with an ostensible model. Human vision is general purpose. In our everyday experiences, we see a rich panoply of things in all sorts of lighting conditions. We are able to operate well in almost any circumstance in which there's even a modest amount of light entering our eyes and which isn't damaging them.
Merely being able to see light is nearly useless, though. The best video cameras today are still just recording or transmission devices; they don't do anything else practical with it. By contrast, we are poor recording and transmission devices. It's our faculties for visual perception that distinguish us. So let's talk about what we do with the visual information we can see.
We can recognize the boundaries between objects. We can recognize objects. We can recognize the repetitions that compose both simple and rich textures. We can intuit the nature and location of light sources without seeing them directly. We can recognize the three dimensional nature of the things we see. We can see how things are connected together and how larger objects are subdivided into smaller ones. We can recognize that the two halves of a car on either side of a telephone pole are actually parts of a single car that is behind the pole. We can tell how far away things are. We can detect the motion of objects we see. We can recognize complex mechanisms with lots of moving parts as components of single larger objects and distinguish them from the backdrop of the rest of the world. We can even recognize a silver pitcher amidst a noisy background as a thing unto itself, even though we only see the reflections of that background.
Perhaps the most interesting feature of human vision that distinguishes it from most machine vision techniques crafted to date is that we can deal very well with novel situations. A new car you've never seen before is still obviously a car because it looks like a car. You instantly catalog novel objects and register essential differences.
How does one distill all this down into a clear definition, then? What is general purpose machine vision? I think it's best to define it in terms of a set of core goals. A machine can be said to have general purpose machine vision if it can:
- Construct a 3D model of the open space within its visual field sufficient for movement within that space and interaction with the objects within it
- Distinguish most any whole object, especially a complex moving one, from the rest of a visual field
- Recognize arbitrarily complex textures as continuous surfaces and objects
- Have a hierarchic way of characterizing all the objects within a scene and their relative positional and connectivity relationships to one another.
- Characterize a novel object using a three dimensional animated model composed of simpler primitives and be able to recognize that object in most any orientation
- Be able to recognize and separate objects in a wide variety of lighting conditions, including complex arrangements of shadows
- Be able to separate and recognize objects that are transparent, translucent, or reflective, given sufficient visual cues
There are probably other milestones one could add to this, but it seems a pretty lofty set for now.
|Figure: A photoresistor|
for detecting light.
The most basic kind of sensor that can be used for vision is one that sees only a single "pixel". A photoelectric cell -- in this case, a photoresistor -- like the one in the figure at right is an example. Note that while, when you think of pixels, you probably think of very small parts of a larger picture, I don't necessarily mean it in this sense. An entire picture can be composed of a single pixel. What matters in this sense is the field of view of the imaging sensor. In the case of many photoelectric cells, for instance, the field of view might include up to half the full sphere of the view around it. To narrow the field of view of a photocell like this is a simple matter. One could, for example, put a small box over the cell and drill a small hole in it so only light coming from a source in the direction of that hole can get to the sensor.
In keeping with the idea that an electronic eye need not be limited to working like our eyes, let's consider some other kinds of spot-source sensors. One technique involves a speaker outputting a very high frequency tone and using a microphone to pick it up. The closer or larger a nearby object is, the more it will reflect that sound and hence the stronger will be the signal to the microphone. A laser beam and a photoelectric cell can serve a similar purpose. In addition to sensing differences in intensity, they can also be used to determine how long it takes for the signal to get from emitter to detector and thus determine the distance to one or more objects.
|Figure: A charge-coupled device|
(CCD) used in digital cameras.
Most digital cameras use the same basic approach to imaging. At their heart is a device that serves the same purpose as a piece of film that is called a charge-coupled device, or "CCD". A set of one or more lenses focuses light onto the CCD, which is made up of a rectangular grid of individual light sensors similar to the photoelectric cell figured in the previous section. The figure at right shows an example of a CCD that has a grid of 1,024 sensors across by 1,280 sensors down and is used in medical X-ray imaging devices. A digital camera outputs information in a form that can easily be interpreted by a computer as a grid of levels of light in one or more discrete electromagnetic wave bands (e.g., red, green, and blue or X-rays and infrared frequencies).
Technically, a computer can use an analog camera as its input, but in any case, a digital computer ultimately must use digital information. The continuous stream of signals from an analog camera, then, must be converted into a stream of digital information that can be interpreted in the same way as a digital camera's output.
|Figure: A laser imager scanning a statue.|
Some machine vision systems use lasers to directly sense the three dimensional shapes of their immediate surroundings.
The basic idea behind this is to exploit the fact that light travels at a known and thus predictable velocity. A laser pulse is sent out in some direction and may be detected by a sensor like the photoelectric cell described earlier if it hits some object relatively nearby or which is highly prone to reflect light back in the direction it came from. Using a very fast clock, the electronics that coordinate the laser and light sensor measure how long it took for the light pulse to be detected and hence calculate how far away the reflective surface is. Because a laser beam can be made to be very fine-pointed, it is generally reasonable to assume that it will only hit a single surface and so only a single response will come back to the sensor.
By gradually pointing the laser at different places -- usually within a rectangular grid pattern -- in the system's field of view, sending pulses of light at each, and taking measurements of the time each pulse takes to be reflected, one can gradually build an image. The image formed is not like the one you are used to. Instead of representing levels of light, each pixel in such an image represents a distance to the surface that the laser hit when it was aimed in that direction.
|Figure: Output from a ground-penetrating radar.|
One particularly interesting idea that has found many expressions is the idea of using echoes to detect objects that are not otherwise visible.
The laser scanners described above use echoes, but rely on the object being detected to be fairly solid and the space between the emitter and the subject being imaged to be fairly empty, relative to the much higher density of the subject.
Imaging objects underground is a great example of a case where the goal is to "see" objects amid surroundings that are not nearly as varied in their densities as one would find with air and rock, for example. One key technique is to project some wave of energy -- perhaps sound waves or microwave energy -- down into the ground and detect the energy that is reflected off of layers and objects in the ground. Because they do have different densities or other properties that affect the energy projected, they will reflect to varying degrees. As each pulse of energy is sent out, the detector is continually measuring the degree of energy coming back over time. The intensity of the returning signal and the amount of time that has passed since the original pulse was sent are typically used to create a linear gradient. By moving the device along a linear path on the ground and sending pulses at each point, a two dimensional image can be composed by taking each linear gradient as a vertical column of pixels and each position along the ground as a horizontal starting point for that vertical column of pixels.
The result of most ground-penetrating radar systems like the one in the figure at right, while often appearing in 2D images like the figure, could not look more alien to our own sense of how vision works. But it's important to recognize that there is information in a visual system like this. With training, anyone can learn to recognize the significance of what's in such images. And so can a machine. And while we don't have the capacity to deal easily with it, one could also take many slices in a grid drawn on the ground and put them together like pages in a book to form a three dimensional picture. To be useful to a human, it would probably be necessary to delete some of the resulting three-dimensional pixels -- also known as "voxels" -- so one can see the other parts as "solid" objects.
One fascinating extension of this same concept is to generate an image of what's below the ground using sound. In one arrangement, two or more microphones are placed on the ground around an area to be imaged. A person with a sledgehammer moves from point to point on a grid and strikes the ground. A computer records the echoes and times when sounds arrive. Again, it may be that more than one pulse is heard by a given microphone, because different objects underground may reflect sound in different ways. Sound waves may even separate and take different pathways to a given microphone. The end result, again, is either an image representing a sort of 2D slice through the ground or a 3D image representing a volume of ground.
This same sort of concept is also used in familiar medical imaging technologies like MRI and PET scanners, not to mention the ubiquitous ultrasound equipment.
Primitive Visual Features
It's natural to want to dive right into the high level techniques and goals of machine vision, but it's important to understand some of the lower level features that we use to characterize images. Most higher level vision approaches involve particular solutions to the problems of how to recognize them or build larger structures based on them.
One of the oldest concepts in machine vision, edge detection is also one of the most enduring. The essence of the technique is to scan an image, pixel by pixel, in search of strong contrasts. With each pixel considered, the pixels around it are also considered, and the more variation there is, the stronger that pixel will be considered to be a part of an "edge", presumably of some surface or object. Typically, the contrast sought is of brightness. This is a function of what can be thought of as the black and white representation of an image, but sometimes hue, multiple color channels, or other pixel-level features are considered.
This pixel-level edge detection operation is so simple and common that it can be found in many ordinary paint programs. The figure below illustrates one use of edge enhancement in the popular PhotoShop program:
|Figure: Using PhotoShop to "detect" and enhance edges.|
The idea of doing contrast-based edge detection had a lot of momentum in the early days when scientists studying human vision determined that our own visual systems use this technique. Once replicated in machines, it seemed like we were just a short way off from having general purpose vision. But early successes in the ability to find edges at the pixel level did not quickly translate into successes in higher level vision goals. We'll explore this more in coming sections.
One of the challenges in translating edges based on contrast into edges of objects is that contrasts can be caused by factors other than the obvious. For instance, a "specular" reflection of light as off a shiny surface can cause the appearance of a sharp edge around the reflection. Similarly, a shadow cast upon a surface can create a strong contrast at the boundary between the shadowed and lighted portions of that surface. These artifacts tend to lead edge detection algorithms to get "false positive" results. Following is an illustration of how shadows can create false positives:
|Figure: The effects of partial shadows on edge detection with an image of leaves.|
On the other hand, "false negative" results can be caused by something as simple as a blurry edge. Consider the figure below:
|Figure: Unexpected results can come from edge detection with blurry edges.|
Note how the woman's nose is completely invisible to this edge detection approach because the edges we perceive are actually very soft and subtle in terms of contrasts. Other edges we infer, like the one at the top of her hair or on her left shoulder, are also missing because of weak contrasts. The shine off her forehead and chin also clearly create strong enough contrasts to result in false positive matches of edges.
The above figures also illustrate how easily edges we perceive as continuous get broken up in pixel-level edge detection algorithms. The messiness of having lots of neighboring and intersecting edges packed into small spaces also really complicates things.
Consider one of the central issues with simple edge-finding algorithms. We'll call it the "threshold problem". It can be expressed simply as "how strong of a contrast is strong enough to consider a place in an image to represent an edge? If one chooses a threshold that's too low, there will be too many edges to be able to be useful. If the threshold is too high, too few edges will be found. The following illustrates the problem:
|Figure: Edge detection using different thresholds: a.) source image; b.) high threshold; c.) medium threshold; d.) low threshold.|
The sad truth is that there is no "right" answer when it comes to choosing a threshold value. What most researchers don't want to admit is that they do not rely on automation to decide what threshold value to use. They choose a value based on the particular application, lighting conditions, and other finer details. This raises the classic AI problem of the "brain inside the brain". That is, it takes an intelligent agent -- the researcher -- to frequently determine a key factor in proper edge detection so it can be "automated". In some situations, like in a factory, the conditions can be controlled. General-purpose vision cannot assume such controlled conditions, though. Your eyes certainly don't.
Despite the shortcomings, edge detection has found much expression in very practical industrial and research systems. The figure below illustrates a sample use of a simple sort of edge detection algorithm in inspection of a manufactured part:
|Figure: An inspection system detects that one of four expected cables is missing.|
In this case, a linear slice one pixel wide is taken where cables are expected to lie in the image. The number of sharp edges (six here) is counted up and divided by two edges per cable, revealing that there are only three of four expected cables. Linear slices like this can be used to spot-check object widths, rotational alignments, and other useful metrics that are helpful in inspection systems. And full-image scans are also used to detect edges in roads and other systems for use in more sophisticated applications.
Regions and Flood-Fill
Finding the edges of objects may seem the basis of finding objects in an image, but it's only the beginning. The edges found on a picture of a human ear, for example, will be far more complicated than the overall shape of the ear. Edges provide a means to finding objects, but finding regions within an image can be thought of as one step higher in abstraction.
One of the most basic means of finding regions in an image is to use a "flood-fill" algorithm. This term comes from the similarity of the algorithm to the basic flood-fill operation most paint programs have. To the program, it's as though a region is a flat plain into which color can be poured, but which has "sharp" edges beyond which the color won't spread. Those edges are usually defined in exactly the same way as we considered above with regards to edge detection.
It's helpful to use the paint program's flood-fill analogy because of its intuitive nature. The following figure shows a picture with some areas sectioned off using a flood-fill algorithm. Each distinct region found gets its own unique color.
|Figure: Using flood-fill to isolate major regions of an image.|
The limits of flood fill start becoming pretty obvious with the above image. For example, notice how the ocean is divided into two parts by the large rock? The left side of the ocean (blue) and the right (green) are obviously part of the same object, to you and me, but not to an algorithm simply seeking out unique regions using a basic flood-fill algorithm.
Another issue is that a flood-fill operation can "spill out" of one region to another. See how the white region includes the nearby rock, part of the cliff on the left side, most of the farther-off rock, the white foam where the ocean meets the beach, and so on? Few of us would assume that all of these separate objects are really part of the same object, yet the flood-fill algorithm doesn't see these distinctions.
One way in which a typical paint program's flood-fill algorithm differs from one used for machine vision is in how they deal with gradients. The following figure illustrates the distinction.
|Figure: Two ways to interpret a smooth gradient using a flood-fill algorithm.|
A typical paint program will take note of the color of the pixel you first clicked on and seek all contiguous pixels that are similar to that one. Hence the separate bands above in the middle image. A typical edge detection algorithm as described above would not find any edges within the gradient; only around the circle and square. A more appropriate flood-fill algorithm for machine vision, then, will fill smooth regions, even where the color subtly changes from pixel to pixel. The filling stops wherever there are harder edges.
One of the more interesting primitive features that can be dealt with in machine vision is repeating and quasi-repeating patterns. Strong textures can confound simple object detection because they can invoke edge detection and stop flood-fill operations. Following are some images that include strong textures that can easily foil such simple operations.
|Figure: Samples of textures, such as grass, bricks, leopard spots, marble, and water.|
Texture recognition is such a challenge to deal with in large part because it's difficult even to define the concept of textures formally. Even dictionaries don't seem to do it much justice. Here are some examples:
- The characteristic appearance of a surface having a tactile quality
- The tactile quality of a surface or the representation or invention of the appearance of such a surface
- In a photographic image the frequency of change and arrangement of tones
What the above sample images illustrate, though, is how obvious the notion of texture seems to our visual systems, even if it's difficult to formally define.
One characteristic that seems somewhat consistent about textures is what can be called a "color scheme". In the first image, the grass is heavy in the greens and blacks. The bricks are heavy in reds and blacks. The water is heavy in blues.
How can we use this in automation? Here's a simple illustration. Imagine taking a sampling of many or all of the colors in a patch of some texture of interest. We'll call that collection of colors the color scheme. Now for each color, we find all pixels in the source image that have that same color and add them to a total selection. Following is an illustration of this using the above images as sources:
|Figure: The same images above with some textures selected based purely on color schemes.|
It should be fairly apparent that in most of the cases above, the color scheme-based selections seem very strongly biased towards highlighting just the textures of interest. The leopard one seems a poor example, to be sure. That seems to be because the black spots themselves are very similar in color to the black in the tree branches, leaves, and so forth. Whatever its power, this neat trick is surely not sufficient for recognizing textures.
What we could do with the selections made, then, is to start by removing the "noise" pixels. That is, we can find places in an image -- like with the grass -- where there are small, stray islands of pixels not in the selection and just add them. Likewise, we can find stray islands of selected pixels among non-selected ones and remove them from the selection. Next, we could segment an entire image up into large blocks -- perhaps squares -- and, for each, see if a large percent of the pixels in it are among the selection. The resulting "block map" can be used to pick out the rough shape or shapes of items with the given texture. And so on.
The above thought experiment assumes that we have "intelligently" picked out some patch of an image as a candidate for a texture. What would be to stop us, alternatively, from picking a patch that contains both some of the water and some of the hills on the shore in the right-hand image, for example?
One other issue this dodges is changes in illumination, as from shadows or the like.
Besides the notion of color schemes applying to textures, there does tend to be genuine structure. The grass texture, for example, has edges that favor up and down orientations. The bricks are definitively ordered from top to bottom in a zig-zag pattern. The leopard's spots are definitively spots with semi-regular spacing, if no obvious ordering. This facet seems to require some more sophisticated processing to deal with.
One interesting approach to texture analysis involves taking a large number of samples of pairs of nearby pixels. For each pixel in the source image, we look around at those pixels within a fixed radius of it. For each pair of pixels, we note the brightness of each pixel. Let's say instead of recognizing 256 shades of gray (brightness), we recognize only 8. We then create a matrix (grid) that's 8 columns wide and 8 rows tall, where columns represent the first pixel's brightness and the rows represent the second's. For each pair we find, we look in the matrix for the place that represents that pair's combination of brightness levels. Each place in the matrix starts out as zero, so each time we find a match for a combination, we add one to it.
With a little extra math, we can boil the resulting matrix down to a set of simplified characteristics called "energy", "inertia", "correlation", and "entropy". These can further simplify the task of recognizing a texture using a neural network or classifier system.
When we're done, we have a matrix that could be used by, say, a neural network to recognize textures. With a little more math, we can improve the ability to deal with some different orthogonal (90°) rotations of a given texture. One downside to this concept, however, is that it doesn't directly address finding edges of textures. Much of the literature seems to focus on cases where the entire image is of an homogeneous texture and nothing else. And one limitation seems to be that if one zooms in or out of a given texture, the resulting matrices will probably be different for a set of texture images.
There are other variants of this sort of concept that involve different mathematical complexities. They generally seem to suffer some of the same limitations, though. If anything, they seem more exercises in fascinating mathematics than in practical vision systems. It seems so much easier to pick out textures in color images than in black and white, yet these techniques focus naively on black and white for mathematical elegance. Despite these sorts of shortcomings, though, their conceptual basis seems to have merit.
As a side note, there is a related but separate field of study into what is called "texture synthesis", which is about using a sample image texture to generate extensions of that texture or new textures altogether based on multiple source images. Following is an illustration of some examples of this concept. Each real image is paired with a new texture programmatically generated based on it.
|Figure: Above are source images and below are new textures synthesized based on them.|
Although synthesis is not the same thing as analysis, there does seem to be a useful symmetry here. The ability to recall a texture from memory is essentially an ability to synthesize it using some set of rules. These rules should be simpler than the original image, in a sense, and be more generic than what one would expect from just tiling an image to create a repeating texture.
Two Dimensional Perception
Taking a step above the primitive features of images discussed above, we can start to talk more about the substantive content in images. We'll focus in this section on two-dimensional features. That is, we'll limit ourselves to images that don't have intrinsic depth; as though we were considering a bulletin board with flat things pinned to it.
Following are some examples of images that we can process in a two-dimensional context.
|Figure: Some images that are good candidates for 2D perceptual processing.|
Not surprisingly, there are many ways to approach analyzing such images. Since there's still no such thing as general purpose machine vision systems, yet, deciding which one to use is often a matter of what one is trying to accomplish.
Pixel Pattern Matching
|Figure: Images of bathroom tiles in our illustration.|
As stated above, the goal of a machine vision system often determines the method chosen. Let's say our goal was to identify the contents of whole images against known images.
Let's say we have a set of images of bathroom tiles that we manufacture. In our application, we will be fed images of whole, single tiles. The images are always of the same width and height. We also have a finite set of images of the tiles we manufacture. As we're fed new images to identify, then, we want to identify which known tile the new image is most like.
Since we have whole images, we decide to do best-fit matching of the whole images. Looking at the figure at right, it seems the main distinguishing feature among the four sample tiles is their overall color. That suggests one simple approach might be to find the average color of each tile. Our database of known tile models would simply have the same average color calculated on one or perhaps several samples of each tile model. So as a new tile image comes past our analyzer, it takes the average color and finds the one in the database that has the shortest "distance" from the sampled color to each archetype's average color. To avoid problems that might arise from the white-space surrounding each tile, we might ignore the outer 10% margins of each image when finding the average color.
Let's say that we found that there are tiles that had the same average color but which are different in shape. Some are square, some hexagonal, and some triangular. The above color-based algorithm would probably not be sufficient. We might modify our algorithm to include shapes. To deal with shape, we'll opt to create simple masks for each known shape. Each mask is just a two-color image, as illustrated by the following figure.
|Figure: Sample masks for recognizing squares, triangles, and hexagons.|
The shapes don't have to be perfectly clean or straight. Each tile model, then, is associated with one of the known shapes. So when we see a new image, we compare the shape of the tile within it against the known shapes. To do this, we might first use a flood-fill starting from one corner of the image to select the white margin around the tile. From this selection we create a new image that has the same two colors as our shape masks. Next, we compare the two images, pixel by pixel. For each pixel that doesn't match, we add one to a count of mismatched pixels. In the end, whichever mask has the lowest mismatch count is the one we choose as best representing the shape of the tile in our test image.
To make our algorithm a little better, we also use the mask we created using flood fill to find the average color by only looking within the area that is not in the outer-margin selection. Armed with the known shape and average color of the tile, we again find the best match for these two properties in our database and thus identify our image.
This thought experiment illustrates how straightforward some vision applications can be when they are defined carefully to reduce their potential complexity. What if we increased the complexity of our present problem? Let's say one series of tiles is white and square, but each has a different large letter (e.g., "A", "B", "C") on it. The above algorithm is no longer sufficient.
To solve this problem, we decide to first identify the model or model series of tile and, if a tile is identified to be in the "letter" series, we'll use a new algorithm to identify which letter it is. We could use the masking approach described above, but let's be creative and say we want to use a neural network. We buy an off-the-shelf neural network software package and train it to recognize each of the letters that we might find on the tiles in the letter series. Training done, we switch the neural net into its regular behavior mode and go from there. With each tile put before it, the neural net will output which model (letter) it thinks the tile represents.
In each case in the above example, we've considered what might loosely be called pixel patterns. We considered the average color of a textured object, the overall shape in terms of a mask, and the shape of some bitmapped feature (letters) within such a shape. We never resorted to trying to find lines or corners or other more abstract features. We didn't even need to deal with images being at different scales or rotations, let alone in varying lighting conditions.
One practical technique available for use in 2D perception applications is the isolation of objects of interest into "blobs" that can be counted, characterized, or have their positional relationships considered.
|Figure: Insects "thresholded" to isolate|
them from a fairly plain background.
The figure at right illustrates a typical example. The technique used to isolate the insects in the image from the background is trivial. The brightness of each pixel is measured and, if it is above a certain threshold value, it is painted white and otherwise black. Each insect, in the thresholded image, can be thought of as a "blob" in the image. We'll call them "blob objects", or "blobjects".
It's easy for us to perceive the individual blobjects, but can be quite a challenge to get a piece of software to do as well. The simplest approach would be to consider every black pixel in the image and, for each, perform a flood-fill. The flood-filling would continue until one of the following conditions is met:
- The width of a bounding box gets larger than some constant W.
- The height of a bounding box gets larger than some constant H.
- The area (number of pixels) of the region gets larger than some constant Amax.
- The region gets fully filled and the area of the region is larger than some constant Amin.
Only in this last condition would we conclude that we've found an insect blob. To help speed up execution a bit, we would keep track of all the pixels we've already tested so as we're continuing the scan, we don't consider the same blobject twice, for example.
It should be apparent from this example, however, that blob detection is not going to be a clean process using the above algorithm. Wherever insects touch or overlap one another, it's likely we will meet one of the above failure conditions. The bounding box might get too big or the total area filled by a region might be exceeded. It becomes necessary to introduce more sophisticated techniques to get more accurate information.
Admittedly, this technique, which can be practical in controlled circumstances and with certain classes of tasks, can be quite useful, is actually very limited. The need to manually set a usable threshold value for separating blob from background means it's usually necessary to ensure that the background against which blobjects are to be placed must be in high contrast to the blobjects. And what makes this fundamentally a two dimensional perception problem is the fact that the blobjects really need to be guaranteed to be generally non-touching and non-overlapping. This is usually much harder to come by in a three dimensional environment.
One somewhat simplified version of blob detection that has found practical application is navigation based on known, fixed points that can be perceived. The term "astral navigation" is a term that's common fare in popular science fiction to identify how a space vessel can get its bearings by observing the positions of stars around the vessel. And now we actually do have such vessels which are able to do this, including Nasa's Deep Space One.
The concept of orienting based on the positions of points is fairly straightforward. First, one takes an image of the stars in the current field of view. In deep space, most of the possible field of view is black or very nearly so. Most visible objects appear as small dots perhaps one or a few pixels in size. It's easy to isolate these blobs from the black of space. Their positions in the image are recorded as a list of points. The goal is to be able to identify which known star each of the given points represents. Once one knows with certainty which stars any two of the points in the image are, it's then easy to figure out which way the spacecraft is facing.
|Figure: Using the relative distances between stars as a|
way of identifying stars for use in self orientation.
Although there are plenty of ways to use point position information in the source image to figure out which stars one is seeing, let me describe one very simplistic way to illustrate how easy it can be. First, assume that our camera cannot change its zoom level. We know that our spacecraft will stay within our own solar system, which means that no matter where we are within the solar system, the positions of luminous objects (stars, galaxies, etc.) outside our galaxy will not appear significantly different than if we were somewhere else in the solar system. So in any picture our spacecraft takes of any two known luminous bodies outside this system, the distance measured between them will be the same. Our solution, then, begins with a database with two basic kinds of information. The first is a list of known luminous bodies. The second is a list of distances between any two known luminous bodies. So we take a picture of the sky and separate all the bright blobs into their own point positions. For each pair of points, we measure the distance and go to our database of body-to-body distances to find candidates. As we do this, we'll have different possible alternatives, but the more distances we measure and correlations we make, the more we will be able to narrow the interpretations down to exactly one and to increase our certainty of the interpretation.
It's important to note that this technique works great in the context of astral navigation because we can count on the field of view to vary minimally within the distances that we care to work with. This is what makes it a two dimensional perceptual problem. If we were talking about a spacecraft that traversed many light years' distance, the field of view would change enough that we would have to change our approach because it would now be a three dimensional perceptual problem.
2D Feature Networks
Given a complex two dimensional scene and a goal of being able to identify all the objects within it, one general approach is to identify a variety of easily isolated primitive features and to attempt to match the combinations of such features against a database of known objects. There are many ways to go about this, and no way seems to fit all needs. Still, we'll consider a few here.
The astral navigation technique described earlier can be a good starting point for identifying objects in a 2D scene. The first thing to do is identify important points. This can be done by identifying exceptionally bright or dark points, blobs of a significant color, and so forth and calculating distances among the points to see if there are known configurations in the scene. Another primitive feature that some researchers have had success in isolating is sharp corners and junctions where three or more lines meet. In an image of a "pac man", for example, there are three sharp corners that form the pie wedge of a mouth in the circular body. A picture of a stick man would have lots of corners and junctions. Once the raw image is processed to find such corners and junctions, they too become points whose relative positions can be measured and compared to known proportions. When a significant percentage of the components that define some object are matched, we can isolate that portion of the image as a single instance of that kind of object.
Another interesting technique that has been tried is to study the outline of a shape. A shape can be isolated using edge detection or flood filling, for example. A secondary image that only includes the outline of a single shape can be isolated. It's not hard, then, to find the smallest possible circle that can fit around that shape and identify its center point. Then the code traverses from one point to the next in the outline. For each point, the distance from the center and the angle are measured. Because the object can be rotated in any angle, one goal is to "rotate" the image until it fits a "standard" orientation. One way to do so would be to find the three or more points that are farthest from the center; i.e., those that touch the outer circumference. Using one of a variety of techniques, one of these points can be identified as the "first" one. The whole image would effectively be rotated around the center point so that that first point is straight above the center point. The image wouldn't literally be rotated, of course. What would actually happen is that an angular offset would be added to each angle-plus-distance measured point so that the "first" point would actually be the first one in the list of such points. Then the distances-from-center would be normalized so that the farthest-out ones would be exactly one distance unit from the center.
The result of all this processing, then, can be graphed as a linear profile, with the X axis going from zero to the full 360 degrees and the Y axis measuring from zero to one the normalized height. This graph can be further analyzed to find known patterns. One simple way to do this is to reduce the resolution of the graph so that the X and Y values range from, say, zero to sixteen and to create a 16 x 16 matrix with true and false values. There would be a true value at any point in the matrix where at least one point is found at that combination of X and Y values. That matrix can then be compared against a database full of such matrices for known shapes.
Three Dimensional Perception
In contrast to two dimensional perception, three dimensional perception is all about processing of information in all three spatial dimensions, not just in flat or virtually flat worlds; usually, of detecting where some or all objects within a visual field are in space. Although there are many interesting experiments and products that deal in 3D perception, this area is much less well developed. Let me introduce some broad areas of interest.
No discussion of 3D perception could be complete without consideration of the most obvious technique of perceiving objects in space: directly. Human eyes are presented with flat images that we have to work with to guess at how far away things are in space. Certain kinds of devices, though, actually "see" how far away things are.
|Figure: Emitted and reflected|
It all started with radar, a British invention dating back to some time between World Wars One and Two. Scientists found that radio waves would reflect off of some kinds of objects and sometimes back toward their original sources. Since we know how fast a radio wave travels -- the same speed as light waves -- it became possible to measure how far away the reflector was based on how long it takes for a radio wave to be received after it was transmitted. It is true of an ordinary transmitter, like a radio broadcasting aerial tower, that its signals get reflected back toward it. And you could probably measure the time differences, but you'd have two problems. First, you'd have to send out very short pulses instead of a continuous broadcast. You need a "beginning" for your signal so you have a beginning of when it returns so you can calculate the time difference. Second, you wouldn't know where in space the reflector -- a car, for example -- is. To figure out that, you need to focus the radio beam so it mainly travels in a single direction. Then you would sweep your transmitter / receiver combination back and forth or around in a full circle so you cover a wide field of view. While radio waves were where radar technology began, technically speaking, most radar systems doesn't use radio waves, but the narrower microwaves. They penetrate weather and certain materials better, but they also can be used to create finer images.
What is true of radio and microwave waves in this regard is also true of light waves. You could, technically, have a friend stand miles away with a mirror and shine a flashlight in his direction and measure how long it takes before you see the reflection in order to calculate how far away he is, but the time delay would be so small that you probably wouldn't notice it. It's been estimated that an object traveling as fast as light could circle Earth about seven times in a single second. Still, we have long had electronics that operate fast enough to detect such small time delays.
The gold standard today in direct perception of 3D spaces is to use lasers. Using the same concepts described above for radar, a scanner makes a laser beam scan left to right, top to bottom in the same sort of way you typically read a page of text in a book. In each direction the scanner is aimed, a laser pulse is fired and a light detector determines how long it takes for a reflection to be measured. Since we have a direction and a distance, we can plot a point in 3D space where the reflection occurred and hence where some part of a physical object is. And since a laser beam can be made to stay very sharp over large distances, it's possible to get very precise 3D coordinates using a laser scanner. Following is an example of a machine for surveying using a laser scanner.
|Figure: Laser scanner used in high definition surveying.|
3D points do not a 3D picture make, though. Usually, the next step is to connect the points together into 3D surfaces. One could simply do this by assuming every reflection point measured is connected to the ones to the left and right and above and below. The problem with this is that one ends up seeing the entire world as one single, solid object. We know, of course, that the 3D world is composed of many separate objects and we know that some things are in front of others that we can't see.
If I were trying to endow a robot with the ability to see using laser scanning, I would probably want it to understand this idea of one object occluding another and the idea that there may be space between them that can't be seen yet. One very simple way to do this is to modify the above mesh-building algorithm slightly. For each pair of neighboring points, I would calculate how far away they are in depth. Above a certain depth, I would declare the two points part of separate surfaces and below it, I would assume the two points part of the same surface. The following figure illustrates this:
|Figure: One way to determine when surfaces are connected or disconnected.|
How would I set the threshold? The broken record comes around again here to sing the refrain that there is no universal answer. Perhaps our goal would be to make it so our robot can move around in space and so we might arbitrarily choose a threshold of, say, 3 feet if our robot can move within a 3 foot wide space. We could also use the "tears" in the 3D mesh to segment distinct objects out using familiar techniques like our basic flood-fill algorithm.
Before you decide that in laser scanners we finally have found the ultimate solution to the 3D perception problem, let me throw water on that fire. If the goal is to get a 3D image of the world, laser scanning is an excellent solution. If the goal is to get a machine to understand the world, laser scanners do nothing than measure distances to points. They don't "understand" the world any better than digital cameras do. And they tend to not see light levels or colors like a camera does; only points in space. As you'll see later, though, laser scanning can be used in conjunction with other techniques as "cheats" to work around solving certain problems that are easily solved by our own visual systems.
|Figure: Two cameras in a stereo (binocular) arrangement.|
We have two eyes. And while it's true they give us a sharper image than a single eye would, the most interesting benefit of having two eyes is that we can use them to help us judge distances. We do so using "binocular" vision techniques.
To understand what this means, try a simple experiment. Look at a corner or other vertical edge on a distant wall. Close your left eye. Stick your finger up at arm's length so the tip is just to the left of that vertical edge. Now open your right eye and close your left. You should see that the edge is now to the right of the edge. Try alternating between having just your left and right eyes open and you'll see that your finger appears to move between being to the left and to the right of that vertical edge. The reason for this is fairly obvious: your eyes are in two different places in the world and so see different views of it. Your brain makes use of these differences to tell you useful things, like the fact that your finger is closer to you than that vertical edge is.
In theory, binocular vision makes perfect sense and is pretty easy to imagine. In practice, though, making software to line up objects seen by two cameras in a binocular arrangement is not so easy to do. One way that's been explored by some researchers is to find "interesting" points in a stereo pair of images and to measure how different those points are from one another in the horizontal direction. Provided one can tell when two points of interest represent the same point in 3D space, one can build up what Hans Moravec calls an "evidence grid". The horizontal offset of each point pair provides "evidence" of a real point in space where that point exists. The following illustrates this idea:
|Figure: Representation of 3D points discovered as "evidence of occupancy" in three dimensions of certain points in a stereo pair of images.|
While it just looks like a cartoonish version of the original image on the left, the one on the right is actually just a 2D projection of a 3D evidence grid built up from a stereo pair of images like the original one seen here. One could take that 3D image and rotate it to project a view from any place within or "outside" the room imaged. And with a bit more processing, one can make some guesses about how points in the evidence grid are related to form meshes of the same sort described earlier for use in 3D laser scanning.
|Figure: Depth discontinuity segmentation.|
Another technique involves trying to match pieces of the left image with pieces of the right image using literal bitmap matching. Because we shouldn't have vertical variation, only horizontal variation, we start by breaking up the original bitmaps into separate horizontal slices from top to bottom in each image and comparing corresponding slices. Each slice is a 1D bitmap, which is much easier to process. The next part is to break this 1D image up into segments using edge detection. Each span of pixels between two bounding edges is considered as a single, solid "object". The average color of each object can easily be calculated and that color can be searched for in the opposite image's corresponding slice. Using some brute computation, we should be able to come up pretty readily with a pretty good interpretation of which objects in the left image slice lines up with the ones in the right image slice's. There will often be miscellaneous bits that don't match, of course. They make computation a little more complicated. But we can even improve on the quality of our guesses by comparing the results of one horizontal slice pair against the ones directly above and below it to see if there are correlations. The end result, though, is yet another set of points in space that are defined by the edges found earlier and including bitmaps that nicely fill the spaces between those points.
|Figure: Calculating distance using stereo disparity.|
One somewhat low-complexity technique for determining distance to objects in a scene involves taking a relatively small portion of what one eye sees -- roughly around the center of its field of view -- and finding out where the best match for it is in the right eye. This requires moving a frame of the same size as the one for the left eye from left to right in the right eye's field of view. At each point, the differences of each left/right pixel pair are summed up. Once this survey is done, the place that had the lowest sum of differences is considered the best match. The horizontal pixel offset of that frame's position in the right versus the left camera's corresponding frame is then used to calculate how far away the subject matter is. This works fairly well when what the frames contain is pretty homogeneous, in terms of distance, or when parts of the background -- perhaps a wall behind a person -- that do creep into the frame are relatively flat in texture. This technique is analogous to how your own eyes work, but it only give distance for whatever is in the frame, rather than building a complete 3D scene.
I hope this brief introduction to machine vision has been helpful to you. As I stated in the beginning, it is by no means complete, but it's not a bad intro if you are just getting started or are just curious. I also hope it has successfully given you the sense that a lot of the stuff being done today is not as complicated -- or competent -- as it is often portrayed in the popular media and technical literature. There's a lot one can do with just some simple tricks.
Moreover, we're clearly nowhere near achieving the ultimate goal of general purpose vision in machines. There's plenty of room for aspiring AI researchers to get in the game, even today. The road ahead is long and the prospects are great.
Incidentally, I have made a point of not making reference to my own machine vision research projects because I didn't want this to be primarily about my work. I invite you, however, to check out my machine vision site for more about what I'm working on.
Following are some other sites I found of interest in the subject of machine vision.