The Input and Output Bandwidth of the Eye and Body

L. Van Warren
Warren Design Vision
Jan 11, 1998
Updated Mar 30, 2001

Introduction:
This short note does some calculations regarding the bandwidth of the eye, viewed as an input device.
This is followed by a consideration of the body as a gesterual output device.
Information acquistion, information delivery.
Less formally:

  • How Many Pixels Does an Eyeball Have?
  • How many vision receptors are there?
  • What is the spatial and temporal resolution of the eye?
  • At what rate can the body express information to the outside world?

First Guess: The Eye has 90 Million "Pixels".
The figure above implies a reasonable value for optical receptor density is 100,000 receptors per square millimeter.  Retinal coverage of is nearly 180 degrees.  To compute the receptor count we need the diameter of the eye.  The radius of the average eye is 12 mm.  This implies that the surface area is 2 p r2 or 904.7 square millimeters.  This first ordercalculation implies that average eye contains 90 million receptors. Most people have two. Eyes that is.  When building display devices, it is convenient to give pixels Cartesian (i, j) or (x, y) coordinates.  Based on this, if we were to build a display device that could simultaneously excite all the optical receptors, and if this display device  were a square, there would be 9512 pixels on an edge.

A Better Approximation: The Eye has 126 Million "Pixels"
According to Dr. John Penn, of the UAMS eye center, the adult retina has 126 million receptors.  He points out that not all of these are activated under all lighting conditions, to wit, "as light environment increases in luminance, rod response becomes saturated long before cones are maximally functional."

Washington neuroscience agrees with Dr. Penn.  According to this source there are 120 million rods and 6 million cones.
Using the figure of 126 million "pixels" or receptors a display device that met or exceeded the performance of a fixed, staring eye would have 11,225 pixels on an edge.

Only Cones See Color
An important detail in this pursuit is to notice that only the cone cells enable the perception of color. There are three kinds of cones, named the "red", "green" and "blue", designated L, M and S cones respectively.  Rods have a peak sensitivity in the green region of the spectrum at 500 nm.  Perhaps that is why night vision goggles use green as their luminance display color.  The cones, though concentrated in the central (foveal) region of the eye, are also distributed throughout the retina, but in low concentration relative to rods.  Rods take more time to acquire signal than do cones, up to 1/10 of a second.  Thus it is the peripheral cones that contribute to motion sensing, not the rods!  If you don't believe this, try playing tennis at twilight.  Your night vision won't do the motion processing job.  In exchange for their slowness the rods contribute extraordinary luminance sensitivity, down to a single photon.  There are 6 million cones.  Let us assume that there is an equal distribution of L, M and S cones.  Now define a  {red, green, blue} triple of cones as equivalent to a single pixel.  This yields two million "color" receptors that our display device must service, for a fixed and staring eye.  This corresponds to a square with 1414 pixels on an edge.

There Is No Fixed and Staring Eye
The only fixed and staring eye is on a dead person!  The living eye is constantly in motion, it must be or the image will wash out.  Books, television, movies and computer displays all require the eye to scan to new content rapidly both spatially and temporally.  These media are all fixed with respect to the coordinate system of an unmoving head.  We can now refine our display device calculation based on this.  If there was such a thing as a fixed and staring eye, then we could simply build a 1414 x 1414 color display and that would keep the fovea busy.  To occupy the rods this display would have to  sit in the center of a much larger 11225 x 11225 pixel display. This outer display would be gray scale only - or "green scale" as the case might be.  The inner display would have two percent of the area and carry five percent of the data of the outer:

But this is only for a fixed staring eye.  If the eye moves to an adjacent point on the screen, it would be necessary for that part of the screen to become color and the rest to revert to the "gray".  Reliably tracking random eye movements and displaying the corresponding color/gray image is a formidable task.  Even if this could be done, it would not account for head movement.  Another alternative would be to put the display on or near the eye itself.

Spatial Resolution of Wearable and Static Displays
For a wearable computer the above arguments tells us that 14142 resolution foveal display at the six o'clock position of standard eyeglasses should be sufficient.  Once could, "draw the curtains" and go into work mode by supplementing with the high bandwidth peripheral display.  For safety reasons while walking on the sidewalk, only the foveal display would remain activated.  A tabletop, wall, television, movie or entertainment display that allows for eye movement, head movement and peripheral vision and  should accommodate the full resolution of the eye, in color!  We also want to acknowledge the architectural simplicity that results from choosing an edge dimension for the image that is a power of two.  If we choose 214 or 16384 pixels on an edge as the figure that meets visual system performance then each image contains 268 million pixels.  Now we can move our eyes and turn our heads just like real people. For static displays he nearest power of two that meets or exceed the 11225 value is 16384 pixels on an edge.

Color Resolution
It is known that eight bits of intensity is unsatisfactory for color reproduction, particularly in the low blues, and that at least 12 bits are preferable.
On computer architectures that align by the byte, a better size choice is 16 pits per color or alpha channel.    We have now defined spatial and color resolutions for static frames that meet computer architecture requirements and meet or exceed the visual system performance.  We have ignored the issue that most display devices cannot reproduce the dynamic range of intensities found in nature, however, 16 levels of intensity bring us closer.
If color pixels consists of four, sixteen bit words, our uncompressed image require 2.15 gigabytes of storage.  Armed with this design information, let us now define an idealized motion picture and transmission capability that would also be designed to the limits of the human visual system.

Temporal Resolution
It is known that many people are able to discern flashing at rates exceeding 60 Hz.  Fluorescent lights are a good example of this.  I can see the corners of this monitor flashing at 72 Hz.  It appears that cone vision is responsible for this temporal sensitivity.  If we were to make a choice that met or exceeded the performance of the human visual system, and was also a power of two, the nearest choice would be 128 Hz.  There is reason to believe that trained fighter pilots, baseball and table tennis players make motion decisions based on these kind of rates so this is entirely appropriate.

Content Channel Resolution
Having established that the user has two eyes, and needs to see imagery at 128 Hz, with 268 million pixels per image we now need to define the number of channels of content that might be made available to the user.  Convention cable systems have from 40 to over 100 channels.  Let us assume that 128 channels is sufficient to meet the needs of most content consumers at a level of quality that meets or exceeds the capabilities of the visual system.

We are not including the audio or haptic requirements of a sensory input limited system, but a similar calculation could be performed that utilized the touch sensor receptor density and  audio frequency range.  Haptic and audio bandwidth requirements are not nearly as severe as visual system demands.

A Calculation

Two Eyes x Channels x    Pixels/Frame     x Bits/Pixel   x Frames/Sec
    2         x   128      x    268,435,456       x      64         x         128      =      6 x 10 14 bits/sec

   21       x     27       x          228               x      26         x         2     =         249  bits/sec
 

So a system that met the performance requirements of the human visual system and satisfied the variety requirement of an intelligent user would require delivery devices and networks whose bandwidth is at least  600 teraHertz.
Such systems would have to operate on a wavelength shorter than 500 nanometers, which is in the range of green light.  To store a two hour movie at an appropriately sampled rate would require

2 x 1 x 268,435,456 x 64 x 128 x 3600 x 2 / 8 = 4 petaBytes (= 4000 Terabyte is 1000 Gigabytes)

This is a little more than a video tape currently holds!

Simplified Calculation for Monocular Displays
 
Traditional monocular displays include CRT monitors, liquid crystal displays, and projection screens that do not exploit polarized light. If we revise our estimates for current display technology and ask the question, "How much data can we deliver to the eye from a traditional display?", we have:

1 [Eye] x 1 [Channel] x 1024 x 768 [Pixels/Frame] x 4 [Bytes/Pixel] x 24 [Frame/Sec]

= 75 [Megabytes/Sec]

Gestural Bandwidth vs. Visual Bandwidth

We might compare this input bandwidth of the human being to the output bandwidth. Output bandwidth, or the rate at which we can express ourselves includes all vocalization and movement that could be digitally captured. This is a somewhat more difficult calculation but we can create an upper bound based on the number of joint degrees of freedom and the number of distinct positions, and the rate at which those positions could change.

Fingers have 3 degrees of freedom, 5 fingers gives 15 degrees of freedom (DOF). Fingers connect to wrist for 3 degrees of freedom, wrist to elbow 2 more, elbow to shoulder 3 more. So an arm has 23 DOF.

The torso, like the shoulder, has 3 gross DOF, dancers can add about another 6 fine DOF to that. The hips (we're wearing a body suit now) have 2, because it shares one with the torso. The legs are like the arms but the toes have 2 DOF. So a leg has 18 DOF.

The head/neck has 3 DOF (at least). We will leave the mouth alone, since it is used for talking, and the eyes alone since they are used for seeing, which we want to be independent of output. The forehead and cheeks have 3 DOF (at least).

A fully instrumented human has 23 + 23 + 3 + 2 + 18 + 18 + 6 = 93 degrees of freedom. Position can be encoded with varying accuracy depending on the joint. All positions can be mapped to a number space just like images are mapped to a pixel space. Motion, or change in position corresponds to animation of an image. If we assume 24 bits per degree of freedom, a complete position of the human body can be described by an image that is 10 x 10 pixels in size. The eye however can process color imagery 1400 x 1400 pixels in size. Thus the visual bandwidth of a person exceeds the gestural bandwidth by ~ 1 part in 20,000. Thus we collect information 20,000 times more effectively than we express it, if only gestures are considered.

What is the bandwidth of speech compared to gesture? There are two places to look, and three comparisons to make. One can compare the bandwidth of American Sign Language (AMSLAN) to speech. Speech is slightly faster. Experienced signers can almost keep up with careful speakers. Speech proceeds at 180 words per second, typewritten gesture is about a third of this figure.

Thus our ability to express ourselves exceeds the rate at which we can take information in by many orders of magnitude. This is important, and shows the prejudice imposed on us by the current generation of input devices. Perhaps the fastest way to help us communicate it to enable us to use gestures to make movies that express how we feel, and what we want to communicate.