Sound Into Graphics

Sound Into Graphics

L. Van Warren • Patrick E. Kane • James Bozek April 6, 1983 Revised December 27, 1995 Converted to HTML February 4, 1998 Abstract

This paper discusses the mapping of sound into graphics for the purposes of generating realistic images of objects that correspond to entities in sound space. The paper is the outline of a plan to generate a short motion picture containing the generated images and sound together. The fast Fourier transform (FFT) is maps between sound sample space and the frequency domain. Polygonal models are generated by conversion programs, which use the FFT and viewer position information. These polygonal models are then scan converted for color raster output.

Description of Project

Changes in sound over time cause changes in a scene. The purpose of this project is mapping a sound into a scene. The audio and visuals are then played together, so that the sound is seen as well as heard. This is accomplished by generating a polygonal model of the sound for each frame via a suitable mapping from Sound Into Graphics and rendering the resulting model using scan conversion techniques.

Implications

The long-term goal is to synthesize both ways, i.e. to map sounds into objects, and objects into sounds. One could then hear changes in a scene that were hidden from the observer by other objects. A deaf person could watch music, and a blind person could hear graphics. Those with both senses intact could have a deeper insight into the nature of the signal that the artist is trying to communicate.

Background

This project began in 1982 at the University of Illinois. The ideas became refined during discussions between Jim Bozek a computer scientist/musician, Patrick Kane an engineer/graphicist who was experienced in computer music and myself an engineering student and occasional musician. One of the problems was choosing a suitable visual model to connect sound and visual spaces. One evening, after viewing the Van Cliburn piano competition, it became apparent that a modified piano key concept embodied the appropriate motion model. A displaced object (not a vibrating one), the piano 'key' produced a sound with a given duration. A convenient inverse map also existed; that is, a sound could be made to produce the displacement of a key as in the case of a player piano. This became labeled as the 'keys' interpretation, and seemed appropriate since it had a root in everyday experience, so that its implications could be understood intuitively by a non-technical audience.

Technique

The fast Fourier transform was used as a conversion between audio space and visual space. Music of interest was digitized at 40,000 samples per second and the discrete samples were converted to characteristic frequency values via FFT for each frame of 'action'. This meant that about 1667 samples contributed to each frame. The characteristic frequency values were then clustered and mapped onto a grid, that, in the 8 x 12 case, had a one to one correspondence with the keys of a piano.

Execution

Pat Kane and Jim Bozek did the FFT's in Illinois using the IEEE signal processing package and utility routines written by them in the 'C' language. Pat Kane also produced single frames of bicubic patches using the Raster Test-Bed (RTB) by Turner Whitted and David Weimer. The author produced single frames of "Piano Keys" polygon model using RTB and subsequently special purpose rendering software. The University of Illinois CSO VAX 11/780 was utilized for the FFT’s and for work done with bicubic patches. The Utah graphics VAX 11/750 was utilized for rendering.

Initial Goals

Five Second Leader

The first objective was a five-second test to establish feasibility. The project represented the connection of several software tools in two different locations. Because of this, it was be necessary to verify the correct functioning of several independent processes. The test leader was intended to reveal those problems that were resolvable and those that were not. Sixty Second Film The second goal was to produce a sixty-second film using the knowledge gained in the running of the five-second leader. The film was be done in three, twenty second duration sections, corresponding to music already digitized, the pieces that were digitized included a piano scale, a piano, an electric guitar piece, and a Synclaviar™ synthesizer piece. Detailed Description of Mappings

The Keys Interpretation

The piano keys mapping described above consists of a rectangular grid of beveled blocks, there are eight rows of 12 blocks each. When no audio signal was applied, the resulting polygonal model resembled an array of rainbow colored buttons. During play, the model resembled a 'city of skyscrapers’. The height of each key corresponded to the root mean square intensity of that particular note.

The Soap Film Interpretation

Another mapping of interest is that of using the sound transform to correspond to z values of a forcing function acting on a soap film that spans a rectangular domain. The surface is generated by solving the boundary value problem generated by Poisson's equation:

in a rectangular domain, using the 'key' heights as the magnitude of the forcing function for a particular point within the domain. The resulting surface is a 'minimal' surface in the mathematical sense, in that is has the minimum surface area that satisfies the constraints. It is, in effect, a stable global interpolant for the surface generated by the influence of the keys. This version was never done, although Pat Kane did test frames involving spline patches, one for each ‘key’.

Software

The movie frames were generated using UNIX™ shell scripts. The ith frame of the 'keys' animation was produced using a statement like this:

keyconv fft.i | apply pos.i | fb_clip | scnv | dd of=/dev/rmt9

where:

keyconv: generates the polygon model for the ith frame.

fft.i: is the ith array of key altitudes.
apply: applies the 4 x 4 transformation in file pos.i to object.
fb_clip: clips the polygons to frame buffer dimensions.
scnv: scan converts the resulting polygon model.
dd: writes the resulting picture directly to magnetic tape.

The position files, which also include corrections for pixel aspect ratio, contained the representative transformations:

ident | obj_pos x_i y_i z_i rx_i ry_i rz_i | perspec e | fb_aspect > pos.i

where:

ident: generates a 4 x 4 identity matrix.

obj_pos: does the translation and rotation of the current 4 x 4.
perspec: does the perspective transformation.
fb_aspect: corrects for the nonsquare pixels on the output device.
pos.i: is the resulting 4 times 4 transformation matrix.

The location values were computed using a view specification program that interpolated key frame values of these parameters. The view path program took the key frames and enabled the preview of the motion on a line drawing display using an iconic cube to represent the orientation of the keys platform with respect to the viewer. Correction for aspect ratio went as follows:

• picture tube was 3 in y by 4 in x
• pixels were 15 in x by 16 in y
So aspect ratio was 3/4 x 15/16 = 45/64

Programming Timeline

PROGRAMMING TASK TIMELINE

TASK PURPOSE PROGRAM TIME
clean up scan conversion scnv.c 2 days
implement anti aliasing scnv.c 4 days
rewrite polygon generation keyconv.c 4 days
write object transformation orient.c 2 days
write object clipping fb_clip.c 1 day
write matrix generation ident.c 1 day
write translation/rotation obj_pos.c 1 day
write perspec transformation perspec.c 1 day
write aspect correction fb_aspect.c 1 day
write shell script movie 1 day

Production Timeline

Five Second Leader Production Task Timeline

TASK TIME

frame generation 1 week

frame recording 1 week

film development ??

film evaluation 1 day

sound transfer 2 weeks

SECTION II: PROJECT RESULTS

Important Changes in Approach - 1983

It was found that parsing the polygon model using yacc was slow. For the 'city of keys' which contained approximately 4500 polygons, it required two hours to parse and render the description file after the polygon description grammar had been compacted. Using the grammar as a way of specifying the model, allowed a great deal of expressive power, as it was much easier to produce and verify a text description of the model rather than a binary description. Debugged versions of the model generation program, the scan conversion program, and the object transformation program were combined into one 'mega' program. This was done to effect quick conversion of the input fft and transformation matrix into output rendered images. Times for the mega version varied from 1.5 to 4.5 minutes per frame, a much more reasonable time for animation.

Update - 1995

When the same ‘mega’ program was ported in its original condition to a PowerPC Macintosh 8500/120, the frame time dropped to 1.6 seconds. This constitutes a 60-fold speed up in 12 years.

Shot List - 1983

Shot List - film for SIGGRAPH '83

'*' indicates photography completed

-------------+----------------------------+------------------------------------
DURATION | PICTURE | SOUND SHELL SCRIPT
-------------+----------------------------+------------------------------------

-------------+----------------------------+----------------------------------
38 1/6 sec | HARPSI. TIME SUBTOT 840 fr. rndr + 1 fr. util + 916 fr. mtrx
-------------+----------------------------+----------------------------------

-------------+----------------------------+----------------------------------
16 sec | CREDITS TIME SUBTOTAL 0 fr. rndr + 5 fr. util + 384 fr. mtrx
-------------+----------------------------+----------------------------------

GRAND TOTALS

-------------+----------------------+-------------+------------+-------------
SCREEN TIME | MOVIE SECTION | RENDER FRMS | UTIL FRMS | MATRIX FRMS
-------------+----------------------+-------------+------------+-------------
9 sec | LEADER | 0 | 169 | 216
8 2/6 sec | TITLE | 1 | 2 | 200
33 1/2 sec | SCALE | 600 | 2 | 804
38 1/6 sec | HARPSI. | 840 | 1 | 916
16 sec | CREDITS | 0 | 5 | 384

-------------+----------------------+-------------+------------+-------------
105 sec | | 1441 | 179 | 2520
-------------+----------------------+-------------+------------+-------------

RENDER CPU TIME: (1441 fr.) x(180 sec/fr.) / (3600 sec/hr.) = 72.1 cpu hours

CAMERA TIME: (2520 fr.) x (30 sec/fr.) / (3600 sec/hr.) = 21.0 con hours

This works out to 63.0 feet of 16mm color negative film.

Important Changes in Approach - Motion Control

After the first footage was processed, it was learned that the most pressing problem the unnatural suddenness of the motion, an artifact of linear interpolation of the key frames. This was fixed by interpolating the key frames using cubic splines.

Surprises

The application of perspective was vital to a correct look to the 'city of keys'. This was most likely due to the presence of a large number of rectilinear features whose interpretation was enhanced by the transformation.

Other Tools Developed

A tool for specifying a standard motion picture academy leader was developed and subsequently improved (See plates). A swept time analog frame counter, center number, crosshairs, were placed in a rainbow boundary, against a fractal background created especially for the project by Todd Fuqua.

Footage Processed

At present five seconds of the 'city of keys' has been rendered and transferred to 16mm film. The results have been encouraging. The 'City of Keys' version was rendered using scan conversion software written by the author, that rendered a polygon at a time into a full screen z buffer for hidden surface elimination. A Phong lighting model was used, with the assumption of an infinite light source. A finite light source version wherein the light vector and the eye vector were recomputed for each visible point on the surface was also tried but the rendering times proved to be too long for practical animation.

Acknowledgments

Several individuals have provided assistance of various essential sorts, without which this short film could not have been made. Lewis Knapp, Spencer Thomas and Dino Schweitzer provided valuable technical assistance and kind criticism that was very helpful in improving the quality of the final product.

Date: 15 Jun 83 22:05 MDT
From: Lewie Knapp <knapp>
Subject: acknowledgement
Message-Id: <8306160406.AA14421@UTAH-GR.ARPA>
To: warren

_________________________________

>From RIESENFELD@UTAH-20 Wed Jun 15 17:36:20 1983
>Date: 15 Jun 1983 1730-MDT
From: RIESENFELD@UTAH-20 (Rich Riesenfeld)
Subject: ACK
To: knapp@UTAH-20, knapp@UTAH-GR

Title of Work*
* This work was supported in part by the National Science Foundation
(MCS-8203692 and MCS-8121750) and the U.S. Army Research Office
(DAAG29-81-K-0111 and DAAG29-82-K-0176) and the Office of Naval Research
(N00014-82-K-0351).

-------

Looks like a mouthful. Small print, I guess.

I probably (definitely) won't make it in tonight.

Give me a call if you need anything.

1995 Update

The previous part of this document was accomplished largely in 1983, with some corrections and additions made. During the computation of the movie, there was some conflict regarding just how much CPU time was being occupied by the project. This was barely resolved in time to enable the project to continue. The film was completed in time for the 1983 SIGGRAPH computer graphics conference, but the referees would not accept material showing up the day of the conference. It was refereed formally in winter 1984 and shown at the summer 1984 SIGGRAPH conference where it was met with the cheers of a technical audience exceeding 8000. For some reason, it was not included in the conference summary video, so no historical record of this work exists except for the 16mm original.

1995 Additions

This document is the collection point for important details regarding the making of the original film and adoption of the techniques to modern processing platforms. For convenience, these additions will be appended to this document.