| Introduction:This short note does some calculations regarding
            the bandwidth of the eye, viewed as an input device.
 This is followed
            by a consideration of the body as a gesterual output device.
 Information
            acquistion, information delivery.
 Less formally:
 
                          How Many
                  Pixels Does an Eyeball Have?
            How
            many vision receptors are there?
            What
            is the spatial and temporal resolution of the eye?
            At
                what rate can the body express information to the outside world?
                           First
                  Guess: The Eye has 90 Million "Pixels". The figure above
    implies a reasonable value for optical receptor density is 100,000 receptors
    per square millimeter.  Retinal coverage
    of is nearly 180 degrees.  To
    compute the receptor count we need the diameter of the eye.  The radius
    of the average eye is 12 mm.  This
    implies that the surface area is 2 p r2 or
    904.7 square millimeters.  This
    first ordercalculation implies that average eye contains 90
    million receptors. Most people have two. Eyes that is.  When building
    display devices, it is convenient to give pixels Cartesian
    (i, j) or
    (x, y) coordinates.  Based on this, if
    we were to build a display device that could simultaneously excite all the
    optical receptors, and if this display device  were a square, there
    would be 9512 pixels on an edge.
 A Better Approximation:
                The Eye has 126 Million "Pixels" According to Dr. John Penn, of the UAMS eye center, the adult retina has 126
  million receptors.  He points out that not all of these are activated
  under all lighting conditions, to wit, "as light environment increases in luminance,
  rod response becomes saturated long before cones are maximally functional."
 Washington
              neuroscience agrees with Dr. Penn.  According to this
              source there are 120 million rods and 6 million cones. Using the figure of 126 million "pixels" or receptors a display device that
  met or exceeded the performance of a fixed, staring eye would have 11,225
  pixels on an edge.
 Only Cones See Color An important detail in this pursuit is to notice that only the cone cells enable
      the perception of color. There are three kinds of cones, named the "red", "green" and "blue",
      designated L,
      M and S cones respectively.  Rods have a peak
      sensitivity in the green region of the spectrum at 500 nm.  Perhaps
      that is why night vision goggles use green as their luminance display color.  The
      cones, though concentrated in the central (foveal) region of the eye, are
      also distributed throughout the retina, but in low concentration relative
      to rods.  Rods take more time to acquire signal than do cones, up
      to 1/10 of a second.  Thus it is the peripheral cones that contribute
      to motion sensing, not the rods!  If you don't believe this, try playing
      tennis at twilight.  Your night vision won't do the motion processing
      job.  In exchange for their slowness the rods contribute extraordinary
      luminance sensitivity, down to a single
      photon.  There are 6 million cones.  Let us assume that there
      is an equal distribution of L, M and S cones.  Now define a  {red,
      green, blue} triple of cones as equivalent to a single pixel.  This
      yields two million "color" receptors that our display device must service,
      for a fixed and staring eye.  This corresponds to a square with 1414
      pixels on an edge.
 There Is No Fixed and Staring
                Eye The only fixed and staring eye is on a dead person!  The living eye is
  constantly in motion, it must be or the image will wash out.  Books, television,
  movies and computer displays all require the eye to scan to new content rapidly
  both spatially and temporally.  These media are all fixed with respect
  to the coordinate system of an unmoving head.  We can now refine our display
  device calculation based on this.  If there was such a thing as a fixed
  and staring eye, then we could simply build a 1414 x 1414 color display and
  that would keep the fovea busy.  To occupy the rods this display would
  have to  sit in the center of a much larger 11225 x 11225 pixel display.
  This outer display would be gray scale only - or "green scale" as the case
  might be.  The inner display would have two percent of the area and carry
  five percent of the data of the outer:
   But this is only for a fixed staring
            eye.  If the eye moves to an adjacent point on the screen, it
            would be necessary for that part of the screen to become color and
            the rest to revert to the "gray".  Reliably tracking random
            eye movements and displaying the corresponding color/gray image is
            a formidable task.  Even if this could be done, it would not
            account for head movement.  Another alternative would be to
            put the display on or near the eye itself. 
         Spatial Resolution of Wearable
                and Static Displays For a wearable computer the above arguments tells us that 14142 resolution
  foveal display at the six o'clock position of standard eyeglasses should be
  sufficient.  Once could, "draw the curtains" and go into work mode by
  supplementing with the high bandwidth peripheral display.  For safety
  reasons while walking on the sidewalk, only the foveal display would remain
  activated.  A tabletop, wall, television, movie or entertainment display
  that allows for eye movement, head movement and peripheral vision and  should
  accommodate the full resolution of the eye, in color!  We
  also want to acknowledge the architectural simplicity that results from choosing
  an edge dimension for the image that is a power of two.  If we choose
  214 or 16384 pixels on an edge as the figure that meets visual system
  performance then each image contains 268 million pixels.  Now we can move
  our eyes and turn our heads just like real people. For static displays he
  nearest power of two that meets or exceed the 11225 value is 16384 pixels
  on an edge.
 Color Resolution It is known that eight bits of intensity
              is unsatisfactory for color reproduction, particularly in the low
              blues, and that at least 12 bits are preferable. On
              computer architectures that align by the byte, a better size choice
              is 16 pits per color or alpha channel.    We have
              now defined spatial and color resolutions for static frames that
              meet computer architecture requirements and meet or exceed the
              visual system performance.  We have ignored the issue that
              most display devices cannot reproduce the dynamic range of intensities
              found in nature, however, 16 levels of intensity bring us closer.
 If color pixels consists of four, sixteen bit words,
    our uncompressed image require 2.15 gigabytes of storage.  Armed
    with this design information, let us now define an idealized motion picture
    and transmission capability that would also be designed to the limits of
    the human visual system.
 Temporal Resolution It is known that many people are able to
              discern flashing at rates exceeding 60 Hz.  Fluorescent lights
              are a good example of this.  I can see the corners of this
              monitor flashing at 72 Hz.  It appears that cone vision is
              responsible for this temporal sensitivity.  If we were to
              make a choice that met or exceeded the performance of the human
              visual system, and was also a power of two, the nearest choice
              would be 128 Hz.  There is reason to believe that trained
              fighter pilots, baseball and table tennis players make motion decisions
              based on these kind of rates so this is entirely appropriate.
 Content Channel Resolution Having established that the user has two
              eyes, and needs to see imagery at 128 Hz, with 268 million pixels
              per image we now need to define the number of channels of content
              that might be made available to the user.  Convention cable
              systems have from 40 to over 100 channels.  Let us assume
              that 128 channels is sufficient to meet the needs of most content
              consumers at a level of quality that meets or exceeds the capabilities
              of the visual system.
 We are not including
            the audio or haptic requirements of a sensory input limited system,
            but a similar calculation could be performed that utilized the touch
            sensor receptor density and  audio frequency range.  Haptic
            and audio bandwidth requirements are not nearly as severe as visual
            system demands.
         A Calculation 
         Two Eyes x Channels
            x    Pixels/Frame     x Bits/Pixel   x Frames/Sec 2         x   128      x    268,435,456       x      64         x         128      =      6
              x 10 14 bits/sec
    21       x     27       x          228               x      26         x         27      =         249  bits/sec 
 So a system that
            met the performance requirements of the human visual system and satisfied
            the variety requirement of an intelligent user would require delivery
            devices and networks whose bandwidth is at least  600 teraHertz. Such systems would have to operate on a wavelength
    shorter than 500 nanometers, which is in the range of green light.  To
    store a two hour movie at an appropriately sampled rate would require
 2 x 1 x 268,435,456
            x 64 x 128 x 3600 x 2 / 8 = 4 petaBytes (= 4000 Terabyte is 1000
            Gigabytes)
         This is a little
            more than a video tape currently holds!
         Simplified
                  Calculation for Monocular Displays
 Traditional monocular displays include CRT monitors,
  liquid crystal displays, and projection screens that do not exploit polarized
  light. If we revise our estimates for current display technology and ask the
  question, "How much data can we deliver to the eye from a traditional
  display?", we have:
 1 [Eye] x 1 [Channel] x 1024 x
            768 [Pixels/Frame] x 4 [Bytes/Pixel] x 24 [Frame/Sec] 
         = 75 [Megabytes/Sec]
         Gestural
                  Bandwidth vs. Visual Bandwidth
         We might compare this input bandwidth
            of the human being to the output bandwidth. Output bandwidth, or
            the rate at which we can express ourselves includes all vocalization
            and movement that could be digitally captured. This is a somewhat
            more difficult calculation but we can create an upper bound based
            on the number of joint degrees of freedom and the number of distinct
            positions, and the rate at which those positions could change. 
         Fingers have 3 degrees of freedom,
            5 fingers gives 15 degrees of freedom (DOF). Fingers connect to wrist
            for 3 degrees of freedom, wrist to elbow 2 more, elbow to shoulder
            3 more. So an arm has 23 DOF.
         The torso, like the shoulder,
            has 3 gross DOF, dancers can add about another 6 fine DOF to that.
            The hips (we're wearing a body suit now) have 2, because it shares
            one with the torso. The legs are like the arms but the toes have
            2 DOF. So a leg has 18 DOF.
         The head/neck has 3 DOF (at least).
            We will leave the mouth alone, since it is used for talking, and
            the eyes alone since they are used for seeing, which we want to be
            independent of output. The forehead and cheeks have 3 DOF (at least).
         A fully instrumented human has
            23 + 23 + 3 + 2 + 18 + 18 + 6 = 93 degrees of freedom. Position can
            be encoded with varying accuracy depending on the joint. All positions
            can be mapped to a number space just like images are mapped to a
            pixel space. Motion, or change in position corresponds to animation
            of an image. If we assume 24 bits per degree of freedom, a complete
            position of the human body can be described by an image that is 10
            x 10 pixels in size. The eye however can process color imagery 1400
            x 1400 pixels in size. Thus the visual bandwidth of a person exceeds
            the gestural bandwidth by ~ 1 part in 20,000. Thus we collect information
            20,000 times more effectively than we express it, if only gestures
            are considered.
         What is the bandwidth of speech
            compared to gesture? There are two places to look, and three comparisons
            to make. One can compare the bandwidth of American Sign Language
            (AMSLAN) to speech. Speech is slightly faster. Experienced signers
            can almost keep up with careful speakers. Speech proceeds at 180
            words per second, typewritten gesture is about a third of this figure.
          Thus our ability to express ourselves
            exceeds the rate at which we can take information in by many orders
            of magnitude. This is important, and shows the prejudice imposed
            on us by the current generation of input devices. Perhaps the fastest
            way to help us communicate it to enable us to use gestures to make
      movies that express how we feel, and what we want to communicate. |