Ramani
Duraiswami - ramani@umiacs.umd.edu
Richard O. Duda, V. Ralph Algazi, Larry Davis, Nail Gumerov, Qing-Huo Liu, Shihab Shamma, Howard Elman, Rama Chellappa, Yiannis Aloimonos, S.T. Raveendra
Institute for Advanced Computer Studies, University of Maryland,
College Park, MD, 20742
Popular version of paper 4aPP11
Presented Thursday Morning, 11:15 a.m., December
7, 2000
ASA/NOISE-CON 2000
Meeting, Newport Beach, CA
Work supported by NSF.
There are many scientific, commercial and entertainment applications for 3-D or spatial sound. An ideal virtual spatial audio system would produce the illusion of hearing sounds as if you were actually present in the room. The 3-D PC-soundcards and the home theater systems that are now available are able to place sounds far to the left, right, and even behind the listener. However, accurate and controllable placement of sounds in all three dimensions -- left and right, up and down, near and far -- is beyond the ability of current technology. All three dimensions must be controlled to produce the virtual audio needed for virtual reality. Our research is directed at making this possible.
Hearing scientists have shown that - in principle, at least - it is
possible to make sounds appear to come from any desired
location. Furthermore, this can be done using only two signals - the
sound reaching each of the ears. By properly controlling the sounds
sent through headphones to the left and the right ears, the experience
of being in a 3-D sound space can be reproduced. By using clever
digital signal processing techniques, audio engineers have shown that
the same effects can be produced using only two
loudspeakers.
The secret to creating these
effects emerged from careful study of the cues that humans use to
locate a sound source.
The most familiar cue is the so-called interaural time
difference, the difference in the times at which the sound waves
coming from a source reach our two ears. However, the interaural time difference is by no
means the only cue.
Although it accounts for much of our left/right perception, it
does not explain our up/down or our near/far
perception.
It turns out that we use not only
the sound traveling directly from the source to our ear canals, but
also the sound that reaches us via other more indirect paths, after
being scattered off our external ears, heads, and bodies, as well as
walls, floors, and other surfaces in the surrounding environment. It is this scattering process
that endows the received waves with cues that the brain deciphers and
processes to locate the source.
Fig. 1: Sound from a source that reaches
our ears includes both the sound that arrives along a direct path and
sound that is scattered by the environment and our external ears and
body. This scattering process amplifies or attenuates different
frequency components, producing cues that enable our brains locate the
source.
When a sound wave is scattered off
an object, its behavior is governed by the ratio of the object size
and the wavelength of the sound. When the object size is much larger
than the wavelength, the sound bounces off like a ray of light hitting
a mirror. However, when the object size and the wavelength of the
sound wave are comparable, the scattered wave is much more
complex. Furthermore, the various components of the sound that reach
the ear via different paths interact with one another. For ordinary
sounds that contain many different frequencies, the result is a change
in the balance between different frequencies that we are
subconsciously able to attribute to the location of the source. The process by which the brain
does this localization is a subject of intense research by
neuropsychologists.
The function that encodes the
relative amplification or attenuation of the sound at a particular
frequency is called the "Head Related Transfer Function," and is often
abbreviated as the HRTF.
The secret to rendering virtual audio accurately is to obtain
the HRTF accurately.
However, there is a major complication. HRTFs are different
for different people. Because we all have different sized and shaped
ears, heads and bodies, we all have different HRTFs. Just as we need our own
customized eyeglasses to see properly, we need our own customized
HRTFs to hear spatial sound properly. Failure to account for individual differences leads
to problems such as elevation errors and high rates of front/back
confusion.
Furthermore, for a true perception
of a localized source, the cues must change with the motion of the
listener. If they do not change properly, the listener can become
confused, and can even experience the sound as coming from within his
or her head. Thus, the
HRTF must not only be customized to the individual listener, but it
must also change correctly when the listener
moves.
Our research is directed at finding
effective ways to solve these two key problems: (a) quick and accurate
determination of individual HRTFs, and (b) quick and accurate ways to
modify the HRTFs dynamically in accordance with a listener's
movements, including changes that arise from changes in the listener's
posture. In both cases, we want to take advantage of advances in
computer vision research and technology to solve these
problems.
In recent years, it has become
possible to use computers to determine many physical properties of
objects from digital video, using developments in powerful computer
vision methods. To measure individual HRTFs, we will use computer
vision techniques to obtain accurate 3-D surface models of a person's
torso, head, and ears. We
will then calculate the HRTF by using numerical methods to solve the
basic equations of physics that govern the propagation of sound
waves. We expect that
this approach will be much more rapid, accurate and convenient than
the acoustic methods currently used to measure HRTFs. In spite of the massive
computation required, advances in high-performance computing now make
such an approach possible. Additionally, part of our research will be
devoted to developing even faster computational
methods.
To modify the HRTFs dynamically, we
will use computer vision techniques to track people as they move, and
to modify the HRTF accordingly.
In our initial work, we will neglect any possible effects of
the limbs and torso, and will focus on the changes that stem solely
from head translations and rotations. However, we also will decompose the HRTFs into
parts that separately account for the contributions of different parts
of the body. This decomposition will provide an ability to account for
the important effects of postural changes.
Fig.2: Our ears have extremely different shapes, resulting in very different sound scattering characteristics, and consequently, very individual HRTFs.
We believe that
this research will lead to effective methods for measuring individual
HRTFs and modifying them dynamically, thereby providing both the
static and the dynamic cues that will produce true 3-D virtual
audio. Such an
accomplishment will be a major advance in the use of information
technology in virtual audio.
An overview of our long-term project, as well as preliminary results that compare the numerical techniques to be used with analytical and experimental results for scattering from simple shapes, will be presented at the conference.
[ Lay Language Paper Index | Press Room ]