A prototype system has been designed and built for M = N = 64 and G = 16. The use of 16 grey-tones ensures sufficient information preservation in shading, while causing smooth changes in brightness to be perceived as smooth changes in loudness. The latter is important to avoid hearing distracting artificial discretization boundaries within, for example, near-uniform image backgrounds. The system design involved a 64 × 64 pixel matrix instead of a 50 × 50 matrix, because powers of 2 are more convenient for a digital implementation. In order to be able to improve the quality of the sound representations in less time-critical situations (e.g., static images), the design also included a user-switchable 1.05 s or 2.10 s conversion time T. According to (6), we find a system throughput of up to 16 kb/s for T = 1s, and 8 kb/s for T = 2 s.
The M= 64 sinusoidal oscillators have not been implemented as physically distinguishable analog components, as was the basis of [6]. Instead, the behaviour of the oscillators is emulated via fast digital computations. In this way, the implementation can be made much more compact (portable), more robust (against detuning) and more flexible (programmable), but also cheaper and lower in power dissipation (for battery operation), than a large set of independent and precisely-tuned analog oscillators could. In [6], the tuning problem was treated by using digital frequency dividers, but the other aspects remained unsolved due to the 64-fold hardware involved in signal integration, modulation, and summation.
![]() Fig. 2. Principles of the sound sample synthesis. |
Fig. 2 depicts schematically
how a sound sample is calculated for M oscillators, by
updating the phase
of oscillator i with a phase step
, calculating the sine of the resulting phase,
scaled by the brightness
of the
pixel at vertical position (row) i, and adding the result
to a superposition accumulator. The phase, phase steps,
scaled sine values and pixel brightnesses all reside in memory
modules. Thereby expensive hardware multipliers or sine
evaluators could be avoided, while increasing the flexibility
for the implementation, via programming, of alternative mappings.
The frequencies of all 64 individual oscillators, determining
the applied bandwidth, are programmable. This provides sufficient
flexibility for further optimization of the system. It may also
be used to take care of individual differences - and malfunctions -
in hearing among users, by using a personalized frequency distribution.
In the present prototype, a 16K-word
memory module
allows for 256 different and independent frequency distributions for
the 64 oscillators, without reprogramming.
To reduce system cost further,
only commercially available components of standard speed
were used. Apart from CMOS memory components, most
of the system logic was implemented using standard LS TTL
technology. Using a system clock of F = 2 MHz, a serial computation
of samples of the superposition of oscillator signals can
take place at a frequency of F/M = 31.25 kHz, which is
sufficiently high for very good audio quality (cf. CD players
using 44.1 kHz). The superposition samples are represented by
16-b values, again to achieve high audio quality. The
proper processing of 2 million pixel oscillator samples
per second at a 2 MHz system clock requires significant
parallelism. Because the transformation itself is fixed,
a pipelined design was made to provide the required data
flow. Several oscillator contributions are being processed
simultaneously, but in physically different stages of the necessary
set of operations. For example, the phase step value of oscillator
i+1 is read from memory at the same time as the scaled sine
value belonging to oscillator i (of which the phase step value has
been read 500ns earlier).
The resulting special purpose computer is optimized
towards the image-to-sound conversion. The whole
conversion system, including 20 ms frame grabbing and
16 grey-tone digitization hardware for input, and the analog
output stages for the headphones, has been implemented on a
single 236 × 160 mm circuit board, dissipating
a measured 4.4 W. In addition, a commercial 2.7 W Philips
12TX3412 vidicon observation camera is presently used for input.
This standard 625-line PAL camera delivers 312- and 313-line interlaced
images every 20ms, of which only 64 lines are used for
later conversion, by neglecting three out of every four lines in
a centered subset of 256 lines.
The digitization hardware applies Gray code encoding
in a 4-b flash AD-converter, and includes a
sample-and-hold circuit for extremely rapid video signal transitions.
The sampling, digitization and storage of a 64 × 64 pixel
image takes place within a single 20 ms frame time, thereby avoiding
blurred images. Because the camera frame generation runs
independently, vertical and horizontal (delayed) synchronization
signals are used to synchronize the conversion system with
the camera. Due to the required synchronization, the conversion
system may first have to wait for a period lasting less than 20 ms,
before the 20 ms frame grabbing of a new camera image
can actually start. Still, 20 ms
40 ms easily
satisfies
. The user only notices a characteristic
synchronization click, as needed to delimit subsequent sound patterns.
![]() Fig. 3. The pipelined system architecture. |
Fig. 3 illustrates the system architecture with a more
detailed overview of the system design,
close to the component level. The analog stages for input
(camera IN) and output (headphones OUT) are only indicated
symbolically. The drawn width of data and
address buses is proportional to the number of bits.
The two oversized 2K × 8 SRAM's for the 64 16-b
values were used only because discrete 64 × 16 b memories
are non-standard, and consequently far more expensive than
these larger memory chips.
Using divisions of the system clock by 221 or 222,
T can at any moment be switched by the user between a 1.05
or a 2.10 second image-to-sound conversion time.
The proper width of the data and address buses has to be
selected very carefully. For example, truncation of digital
phase values to a smaller number of bits leads to a kind of
frequency noise. The bandwidth of this noise should be much less
than the frequency differences between neighbouring oscillator
frequencies. In the present design this condition is met by
a frequency noise less than
, due to a division of F by
M = 64 = 26 to arrive at the sound sample frequency, and an additional
division by 212 corresponding to the truncated 12-b phase values
used for calculating sine values within one whole sine period.
The truncation to 12 b can lead to at most a 1/212 period
phase step error when going from one sound sample to the next.
The different oscillator frequencies are freely programmable integer
multiples of
, because the
phase values are updated and stored as 16-b values before their
truncation to 12 b. This ensures the assumed orthogonality of
the oscillators, and provides a very fine scale for selecting
and programming a particular set of frequency distributions.
For the purpose of experimentation, an observation monitor is presently an integral part of the experimental setup, to provide visual feedback to the sighted experimenter or teacher on the field of view and on the quality of the camera image, concerning contrast and definition.