Image to Sound Mapping [Part 1]


« The vOICe Home Page
Page contents:

Principles of image to sound conversion
In order to achieve a high resolution, each image is transformed into a time-multiplexed auditory representation.


Principles and frequency distributions

The above figure illustrates the principles of the conversion procedure for the simple example of an 8 × 8, 3 grey-tone image (instead of the 64 × 64, 16 grey-tone mapping that we use in reality). The mapping translates, for each pixel, vertical position into frequency, horizontal position into time-after-click, and brightness into oscillation amplitude. It may take one or two seconds to convert the whole pixel matrix into sound. For a given column, every pixel in this column is used to excite an associated sinusoidal oscillator in the audible frequency range. The set of frequencies can in principle be chosen arbitrarily, but two well-defined benchmark sets are formed by the so-called linear (equidistant) and exponential (well-tempered) distribution, respectively.

Many other types of frequency distribution could be used for an auditory display. For example there is the mel-scale that stems from perceptual research on subjective equidistance perceived in frequency steps, and the bark-scale that originates in critical band measurements. These additional psychophysical scaling functions are available as options in The vOICe for Windows. An advantage of perception-based distributions is the possibility to best exploit the mix of strengths and weaknesses in auditory perception and segregation. A disadvantage shared by perception-based distributions and psychophysical scaling functions is that they easily lead to a proliferation of alternative expressions and parameter sets, each dedicated to a particular measurement procedure that more or less highlights particular perceptual limitations under a limited set of conditions. In addition, there is the risk of optimizing sonification mappings for existing psychological and cultural habits (with demographic variations) rather than for any fundamental limits that would only appear after prolonged exposure and training (learning and/or unlearning). Furthermore, there is the role of informational masking in becoming aware of relevant details in complex soundscapes, and this factor is likely very dependent on user experience. Following the intuitive mapping preferences of listeners is not necessarily the right choice for obtaining best performance results in the long run.

The following template allows you to obtain lists of frequencies and frequency steps for several types of frequency distribution, and you can override several of the default parameter settings:

Frequency Distributions

Input the lowest frequency fl: Hz
Input the highest frequency fh: Hz
Input the number of frequencies N: (size N of N×N pixel image)
Exponential distribution (select list)

using fi = fl · (fh / f)(i-1) / (N-1) for i = 1, ..., N

Linear distribution (select list)

using fi = fl + (fh - f) · (i-1) / (N-1) for i = 1, ..., N

From a Fourier-based frequency-time uncertainty analysis, it is estimated that with the linear distribution the image-to-sound conversion time T should be at least about seconds, which is calculated from T = 1.430 · (N-1) · N / (fh-f). (A provision is that T / N is larger than 10 / f = seconds to ensure that the number of sine periods per column is much larger than one.) For shorter conversion times the actual image resolution within the soundscape will drop below the specified N×N resolution due to the frequency spreading in the side lobes of neighbouring pixels in a column.
 
In auditory research, the mel-scale is sometimes used. The mel-scale is a frequency distribution that is meant to be perceptually equidistant, and at the 1000 Hz reference frequency the pitch should equal 1000 mels. There is no full consensus yet about the exact definition of the pitch value in mels as a function of frequency, but approximative analytical model expressions can be found in the literature. Our choice is that for equidistant points in mels at mi we derive a corresponding set of frequencies fi such that the particular model expression mi = 1000 · ln(1 + fi / b) / ln(1 + 1000 / b) is fulfilled, where fi is again the frequency in Hz, mi is the resulting mel-scaled frequency, and b=. It turns out that a good perceptual match is obtained for b=700, and the above expression can then be rewritten as mi ~= 2595 · 10log(1 + fi / 700). In the table on the right the appropriate scalings have been applied to arrive at a frequency table with N frequencies in the range from fl to fh. For low frequencies, the mel-scale is almost linear, while for medium and high frequencies it becomes exponential in nature.
Mel distribution (select list)

using mi = ml + (mh - m) · (i-1) / (N-1)
for i = 1, ..., N
 
Yet another perception-based frequency scale is the Bark-scale, used to express the critical band rate, which may be interpreted as a measure for the tonotopic mapping in the cochlea. Although originally published as a table of values (Zwicker et.al., and often used in simplified cochlear models involving a discrete filter bank with some 24 filters), convenient invertible and accurate analytical function approximations have been derived by Hartmut Traunmüller (J. Acoust. Soc. Am., Vol. 88, 1990, pp. 97-100). The critical band rate zi is given by zi = 26.81 / (1 + 1960 / fi ) - 0.53 for frequencies fi in Hz. The error is less than 0.05 Bark for the frequency interval [0.2,6.7] kHz. Using equidistant but generally non-integer points zi in Bark, we obtain the table shown on the right, with N frequencies in the range from fl to fh. Here we will skip the intricacies of the tonotopic mapping versus temporal processing, which can further affect the choice of the most appropriate pitch scale depending on the soundscape characteristics. [Hartmut Traunmuller]
Bark distribution (select list)

using zi = zl + (zh - z) · (i-1) / (N-1)
for i = 1, ..., N


Artificial scenes with FREE SOFTWARE

The following figure illustrates the principles of the mapping for an artificial scene containing a bright diagonal line, three very short horizontal lines and a (partially) bright filled rectangle on a dark background. The spectra for two columns (labeled red and green) are shown on the right. A corresponding sound for this figure can be heard when clicking on the figure, which links to a sound stored in Windows wave (.wav) format.

Artificial scene with bright diagonal line, etc

For this particular auditory scene, the sound was synthesized off-line by a computer program (ANSI C source code arti1.c, or its port to Python 3 arti1.py). The unwanted ``speckles'' in the sound are the consequence of the simple rectangular time window employed in turning pixel sounds on or off. Here follows an overview of several computer-generated soundscapes with associated (completely self-contained) program sources:

Artificial 64 × 64 pixel / 16 grey-tone scenes
21K soundscapes (20 kHz / 8 bit / 1.05s, loop=true)

Lines & rectangles

B&W drawing

Math curve plot

PC screen layout:
icons & windows

arti1.wav

arti2.wav

arti3.wav

arti_gui.wav
arti1.c
arti2.c
arti3.c
arti_gui.c
More scenes

Speech synthesis

Letters A, B & C

Parked car

Human face

artitalk.wav

arti_abc.wav

arti_car.wav

artiface.wav
artitalk.c
arti_abc.c
arti_car.c
artiface.c

... and for those who wish to further explore the possibilities of software-based hifi stereo soundscape generation, the source of a more elaborate computer program, hificode.c, is available. This program was written to provide you with additional options for obtaining higher quality soundscapes (in this example hificode.wav). The program can generate CD quality 44.1 kHz 16-bit stereo samples, where the stereo effect now incorporates both interaural time delays and head-masking. Furthermore, the above-mentioned speckles are avoided by using smooth quadratic (variation diminishing, or QVD) B-spline time windows which completely remove the column switching discontinuities. Finally, a variation of the hificode.c program for live camera processing with OpenCV on Windows or Linux is available as hificode_OpenCV.cpp (or its direct port to Python 3 hificode_OpenCV.py, which in a Python interpreter tends to run too slowly for practical use). A zipped Microsoft Visual Studio 2010 project for hificode_OpenCV.cpp is vOICeOpenCV.zip, in which you will likely need to adapt the following Project properties to account for your version of OpenCV (here 2.4.11) and the path where you installed it: C/C++ | Additional Include Directories, Linker | Additional Library Directories, Linker | Input | Additional Dependencies.

Creative Commons License Please note that the software in this section has not been optimized for speed, code quality or any other purpose but to provide a crude but functional reference for use in implementation and testing on various platforms. You may use the software (source code) that is directly linked from this section on this (and only this) web page under the Creative Commons Attribution 4.0 International License (CC BY 4.0): that is, you may freely share and adapt the software for any purpose, including commercial usage, provided that in all places where you describe or use the functionality you give credit to the author (Copyright © Peter B.L. Meijer) for providing the original version of the software, and include a link back to this web page (http://www.seeingwithsound.com/im2sound.htm) and/or website (http://www.seeingwithsound.com).

Note: instead of implementing The vOICe image-to-sound mapping algorithms in python, you can also simply launch The vOICe web app from within python using just a few lines of code:
# Launch The vOICe web app from python; needs online access only once import os, platform appURL = 'https://seeingwithsound.com/webvoice/webvoice.htm' browser = 'firefox' # or 'chrome' osStart = 'start ' if platform.system() == 'Windows' else '' os.system(osStart+browser+' '+appURL)
Thanks to the persistent caching of progressive web apps, you need to be online only once when running this code, and in subsequent runs it will work offline as well.

 

Relations with other work

It is worth noting that soundscapes like those given above could be used to convey curve shapes in (multiple) mathematical function plots, teach the congenitally blind about visual perspective, and fulfill a number of other educational purposes. In fact, the mapping of The vOICe even encompasses the Sound Graph method as described in Douglass L. Mansur, Meera M. Blattner and Kenneth I. Joy, ``Sound Graphs, A Numerical Data Analysis Method for the Blind,'' Journal of Medical Systems, Vol. 9, pp. 163-174, 1985. In that paper it was stated that ``mathematical concepts such as symmetry, monotonicity, and the slopes of lines could be determined quickly using sound,'' and advantages over tactile mappings were observed. For a single two-dimensional line graph, the mapping of The vOICe simply reduces to the above Sound Graph mapping (except for some minor differences w.r.t. conversion time, pitch range and discretization). Another mapping encompassed by The vOICe is the resynthesis of speech as performed by the Pattern Playback, an early talking device developed in the 1940s by Franklin S. Cooper and others. That device used a rotating disk with holes to frequency-modulate the light shining through or reflected from slices of a moving spectrogram, and the modulated light was then converted into an electrical signal driving a loudspeaker. The technology behind the Pattern Playback is similar to that of the Nipkow disk as patented in 1884 by Paul Gottlieb Nipkow. Technically, one could say that the Pattern Playback relates to The vOICe as the Nipkow disk relates to the television of today, although it should be stressed that The vOICe was not developed for speech synthesis purposes.

In graphical user interfaces (GUI's), icons displayed on a computer screen would become simple earcons when using bright icons on a dark background. For specific applications like these, there may always exist more appropriate but less general image to sound mappings. For instance, one could attach fixed or parameterized earcons to the small collection of object types that make up a GUI - widgets of several categories like the button, the checkbox, the menu, the scrollbar, the dialog, and, of course, the window. However, the abstract mapping of The vOICe is far more general by covering arbitrary images. This is now also demonstrated by The vOICe Java applet and by The vOICe for Windows.

Links to related online material available at other sites can be found in the external links page, and projects in which image-tosound mappings similar to or related to The vOICe mapping are employed can be found on the related projects page.

Notes on Auditory Icons, Earcons, Voicels & Wavelets
Apart from sonifying natural images and figurative drawings, one could also use or superimpose any computer-generated artificial images, or spectrographic images derived from natural sounds, as input for The vOICe mapping. Thereby, one could indeed cover a very wide range of sound types, ranging from human speech or fixed auditory icons up to parameterized earcons involving melodic or rhythmic motives. The drawback would be that much additional effort is needed to construct such intermediate mappings, while it is only feasible for restricted environments. This has to be weighed against potential advantages in improved perception and ease of learning. For instance, dragging and dropping of icons can be accompanied by natural sonic metaphors like dragging and dropping sounds, but if the collection of different objects and actions becomes large, it becomes increasingly difficult to find or invent corresponding intuitive and informative sounds that still require little or no learning effort. A single completely general mapping like the mapping of The vOICe may then become more attractive, even if it is hard to learn. Whereas auditory icons and parameterized earcons are probably an excellent choice for a sonic or multimodal sensory representation of the basic functionality of an object-based GUI, their scope does not extend to natural images from the environment, nor to arbitrary bitmaps, unless one views, by definition, the pixel-level representation offered by The vOICe as being the superposition of elementary earcons, tentatively called ``voicels'': each pixel, having a brightness and a vertical and horizontal position, is represented by a parameterized short beep, a voicel, having a corresponding loudness, pitch and time-after-click. Voicels are closely related to wavelets w.r.t. their localization in time and frequency. However, the voicel basis functions need only be approximately orthogonal, thereby giving a greater freedom in defining alternative voicel types within the relevant auditory/perceptual limits. For instance, simple truncated sine waves - as presently used by The vOICe - are not wavelets, but they nevertheless yield a mapping that is virtually bijective (hence invertible) under practical conditions. This can be demonstrated by spectrographic reconstructions. A further discussion on wavelets can be found at the auditory wavelets page.

Continued at Image to Sound Mapping [Part 2] »

Copyright © 1996 - 2024 Peter B.L. Meijer