4
Hearing Speech
How our auditory
system supports our ability to understand speech is without a doubt one of the
most important aspects of hearing research. Severe hearing loss often brings social
isolation, particularly in those who were not born into deaf communities. For
most of us, the spoken word remains the primary communication channel,
particularly for the intimate, face-to-face exchanges and conversations that we
value so much. Naturally, we hope that, in due course, a better understanding
of the neurobiology of speech processing will improve our ability to repair
problems if this system goes wrong, or may allow us to build artificial speech
recognition systems that actually work, computers that we could actually talk
to. But while the potential rewards of research into the auditory processing of
speech are great, considerable technical and conceptual challenges also slow progress
in this area.
One major challenge stems from the
great complexity of speech. Essentially, a relatively modest number of speech
sounds (sometimes called “phones”) are recombined according to rules of
morphology to form words, and the words are recombined according to the rules
of grammar to form sentences. Our auditory brain can analyze the sound of any
sentence in a language we have learned, and decipher its meaning, yet the
number of correct, meaningful sentences in any language is so large as to be,
for all practical intents and purposes, infinite. Thus, when we learn a
language, we must not only acquire a sizeable lexicon of sound to meaning
mappings, but also perfect our grasp of the rules of morphology and grammar,
which allow us to manipulate and recombine the speech sounds to generate
endless varieties of new meanings.
Many animal species use
vocalizations to communicate with members of their own species, and sometimes
these communication sounds can be quite elaborate—consider, for example,
certain types of bird song, or the song of humpback whales. Nevertheless, the
complexity of human speech is thought to have no equal in the animal kingdom.
Dogs and some other domestic animals can be trained to understand a variety of
human vocal commands, and rhesus monkeys in the wild are thought to use between
thirty and fifty types of vocalization sounds to convey different meanings. The
number of different vocalizations used in these examples of
inter- and intraspecies vocal communication
is, however, fairly modest compared to the size of a typical human language
vocabulary, which can comprise tens of thousand words.
Animals also seem to have only a
limited capacity for recombining communication sounds to express new concepts
or describe their relationships. Even the most monosyllabic of human teenagers
readily appreciates that the meaning of the sentence “John eats and then flies” is very different from that of “And then John eats flies,” even though
in both sentences the speech sounds are identical and only their order has
changed a bit. Human children seem to learn such distinctions with little
effort. But while you do not need to explain this difference to your child, you
would have a terribly hard time trying to explain it to your dog. Indeed, many
language researchers believe that humans must have an innate facility for
learning and understanding grammar, which other animals seem to lack (Pinker,
1994).
Given this uniquely high level of
development of speech and language in humans, some may argue that there is
little point in studying the neural processing of speech and speechlike sounds in nonhuman animals. However, the tools
available to study the neural processing of communication sounds in humans are
very limited. Work with humans depends heavily either
on lesion studies that attempt to correlate damage to particular brain areas
with loss of function, on noninvasive functional imaging techniques, or on the
rare opportunities where electrophysiological recording or stimulation
experiments can be incorporated into open brain surgery for the treatment of
epilepsy. Each of these methods has severe limitations, and the level of detail
that is revealed by microelectrode recordings in animal brains is, as we shall
see, still an invaluable source of complementary information.
But using animal experiments to
shed light on the processing of vocalizations is not merely a case of “looking
for the key where the light is.” Human speech almost certainly evolved from the
vocal communication system of our primate ancestors, and it evolved its great
level of complexity in what is, in evolutionary terms, a rather short period of
time. No one is quite certain when humans started to speak properly, but we
diverged from the other surviving great ape species some 5 million years ago,
and our brains reached their current brain size some 1 to 2 million years ago (no
more than 100,000 generations). During this period, human speech circuits almost
certainly arose as an adaptation and extension of a more or less generic,
rudimentary mammalian vocal communication system, and animal experiments can
teach us a great deal about these fundamental levels of vocalization
processing.
Speech can be studied on many levels,
from the auditory phonetic level, which considers the manner in which
individual speech sounds are produced or received, to the syntactic, which considers
the role of complex grammatical rules in interpreting speech, or the semantic,
which asks how speech sounds are “mapped onto a particular meaning.” As we have
seen, studies of animal brains are unlikely to offer deep parallels or insights
into syntactic processing in the human auditory system, but neural
representations of speech sounds or animal vocalizations on the auditory/phonetic
level are bound to be very similar from one mammal to the next. Most mammals
vocalize, and do so in much the same way as we do (Fitch, 2006). As we briefly
described in section 1.6, mammals vocalize by pushing air through their larynx,
causing their vocal folds to vibrate. The resulting sound is then filtered
through resonant cavities in their vocal tracts, to impart on it a
characteristic “formant” structure. (On the book’s web site you can find short video clips showing
the human vocal folds and vocal tract in action. <flag>) There are some
differences in detail, for example, humans have a relatively deep-sitting
larynx, which makes for a particularly long vocal tract, and may allow us to be
a “particularly articulate mammal” (Ghazanfar & Rendall, 2008), but the basic layout is essentially the
same in all mammals, and the sounds generated by the vocal tracts of different
types of mammals consequently also have much in common. These similarities are more
obvious in some cases than in others (few humans, for example, would be very
flattered to hear that they sound like a donkey), but at times they can be very
striking, as in the case of Hoover the talking harbor seal.
When vocalizing to communicate
with each other, animals also face some of the same challenges we humans have
to overcome to understand speech. For example, both human and animal
vocalizations are subject to a fair amount of individual variability. No two
humans pronounce the same word absolutely identically. Differences in gender,
body size, emotional affect, as well as regional accents can all change the
sound of a spoken word without necessarily changing its meaning. Similarly,
dogs who differ in size or breed can produce rather
different sounding barks, yet despite these differences they all remain
unmistakably dog barks. Even song birds appear to have pronounced “regional dialects”
in their songs (Marler and Tamura, 1962). One common
problem in the processing of human speech as well as in animal vocalizations is
therefore how to correctly identify vocalizations, despite the often very
considerable individual variability. The neural processes involved in
recognizing communication sounds cannot be simple, low-level feature
extractors, but must be sophisticated and flexible pattern classifiers. We
still have only a very limited understanding of how animal and human brains
solve this kind of problem, but in experimental animals, unlike in humans,
these questions are at least in principle amenable to detailed experimental
observation.
This chapter is subdivided into
seven sections. In the first two sections, we examine the acoustic properties
of speech in greater detail, consider how the acoustic
features of speech evolve in time, and how they are categorized into distinct
speech sounds. In the third section, we describe how speech sounds are thought
to be encoded in subcortical structures. Our
knowledge of these subcortical representations comes
exclusively from animal experiments. In the fourth part, we briefly review the
anatomy of the cortex, and in the fifth we summarize what clinical observations
have taught us about the role played by various parts of the human cerebral
cortex in speech processing. In the last two sections, we examine the roles of
primary and higher-order cortical fields in greater detail, in the light of
additional information from human brain imaging and animal experiments.
When one considers
speech as an auditory stimulus, one obvious question to ask is: Is speech
radically different from other sounds, and if so, in what way? In section 1.6,
we looked at the production of vocalization sounds, and we noted that the
physics of vocal sound production is not particularly unusual, and contains
nothing that might not have an equivalent in the inanimate world. (You may wish
to glance through section 1.6 quickly before you read on if it is not fresh in
your mind.) The harmonics in voiced speech sounds produced by the oscillating
vocal folds are not so different from harmonics that might be produced by
vibrating taut strings or reeds. Unvoiced fricatives are caused by turbulent
airflow through constrictions in the vocal tract, and they resemble the noises
caused by rushing wind or water in both the way they are created and the way
they sound. Resonant cavities in our vocal tract create the all-important formants
by enhancing some frequencies and attenuating others, but they operate just
like any other partly enclosed, air-filled resonance chamber. So if speech
sounds are, in many respects, fundamentally similar to other environmental sounds
and noises, we might also expect them to be encoded and processed in just the
same way as any other sound would be by neurons of the auditory system.
Nevertheless, listeners only
rarely mistake other environmental sounds for speech, so perhaps there is
something about speech that makes it characteristically speechlike,
even if it is not immediately obvious what this something is. Consider the
sound of wind rushing through some trees. It may contain noisy hissing sounds that
resemble fricative consonants (fffff-, sss-, shhh-like sounds), and it
may also contain more harmonic, vaguely vowel-like “howling.” But the pitch and
amplitude contours of these sounds of howling wind usually change only slowly,
much more slowly than they would in speech. Meanwhile, a small stream of water
dropping into a pond might trickle, gurgle, and splash with an irregular rhythm
rather faster than that of speech. Thus, it seems that speech has its own
characteristic rhythm. Speech sounds change constantly in a manner that is
fast, but not too fast, and somewhat unpredictable yet not entirely irregular.
But can we turn this intuition regarding possible characteristic rhythms of
speech into something more tangible, more quantifiable?
One way to approach this question
is to consider the mechanisms of speech production in a little more detail. A
good example of this type of analysis can be found in a paper by Steven
Greenberg (2006), in which he argues that the syllable may the most appropriate
unit of analysis for speech sounds. Based on a statistical analysis of a corpus
of spoken American English, he concluded that syllables consist of an optional
“onset” (containing between zero and three consonants), an obligatory “nucleus”
(a vowel sound, which can be either a monophthong
like the /a/ in “at,” or a diphthong, like the /ay/ in “may”), and an optional
“coda” (containing between zero and four consonants). A single English syllable
can therefore be as simple as “a” or as elaborate as “straights.” Greenberg
would argue that more “atomic” speech sound units, such as the phoneme, or
phone, are unreal in the sense that they have no independent existence outside
the syllabic framework. Furthermore, he points out that the information content
of consonants depends on the syllabic context. For example, onsets are more
informative than codas, as can be seen by the fact that consonants in the coda
can often be lost without any loss of intelligibility (consider the lost /d/ in
“apples an’ bananas”). Given the diversity of English syllables, it is
unsurprising that they can also vary considerably in their temporal extent.
English syllables are typically 100 to 500 ms long, and are characterized by an
“energy arc,” since the vowel nucleus is normally up to 40 dB more intense than
the consonants of the onset or the coda. Note that not all languages exhibit as
much phonetic diversity in their syllables as English. In spoken Japanese, for
example, onsets very rarely comprise more than a single consonant, and the only
commonly used coda to a syllable is an optional “n.” Consequently, in Japanese
there are only a few hundred possible syllables, while in English there are
many thousands, but the onset-nucleus-coda syllabic structure is clearly a
feature of both languages.
As mentioned in chapter 1,
engineers like to refer to changes in a signal over time as “modulations”, and
they distinguish two fundamental types: amplitude modulation (AM, meaning the
sound gets louder or quieter) and frequency modulation (FM, meaning the
frequency content of the sound changes). Greenberg’s observation of one energy
arc in every spoken syllable, and one syllable every few hundreds of milliseconds
or so would lead us to expect that speech sounds should exhibit marked AM at
modulation rates of a few Hertz. Is this expectation borne out? If you look at
spectrograms of spoken sentences, like those shown in figure 2.13A or figure
4.1A, you do, of course, notice that speech contains
plenty of both AM and FM. But it is not obvious, just from looking at the
spectrograms, what the properties of these modulations really are, or whether
speech exhibits characteristic modulations, which would be either particularly
prominent in spoken sentences or particularly important in carrying the
information encoded in the speech signal.
How best to identify and describe
the modulations that are characteristic of speech is an old and important
research question. One recent study by Elliott and Theunissen
(2009) sheds new light on this by analyzing and manipulating the modulation
spectra of speech. (At first glance, the concept of a modulation spectrum is
perhaps a little technical and abstract, but it is useful, so hang in there for
the next two pages or so.) Essentially, modulation spectra are the two-dimensional
Fourier transforms (2DTFs) of the signal’s spectrogram. This may appear
terrifyingly complicated to the uninitiated, but it is not quite as bad as all
that. The concept is illustrated in figure 4.1. Figure
4.1A shows the spectrogram of a spoken sentence. Recall
from section 1.3 that ordinary Fourier transforms express a one-dimensional
signal, like a sound, as a superposition (or sum) of a large number of suitably
chosen sine waves. The 2DFT does a very similar thing, by expressing a two-dimensional
“picture” as a superposition of sine wave gratings or “ripples.” Think of these
ripples as regular zebra stripes of periodically (sinusoidally)
increasing and decreasing amplitude.
Figure 4.1B illustrates this by showing how
regions within the spectrogram are well approximated by such ripples. Consider
the region delineated by the leftmost black elliptic contour in the spectrogram.
This patch contains a very obvious harmonic stack associated with one of the
vowel sounds in the sentence, and the regularly spaced harmonics can be well
approximated by a ripple with a matching spectral modulation rate, that is, stripes
of an appropriate spacing along the vertical, frequency, or spectral dimension.
In this conceptual framework, spectral modulations are therefore manifest as zebra
stripes that run horizontally across the spectrogram, parallel to the time axis, and a set of harmonics with a fundamental frequency of
200 Hz would thus be captured to a large extent by a spectral modulation with a
modulation rate of 5 cycles/kHz. Perhaps counterintuitively,
a lower-pitched sound, with harmonics spaced, say, every 100 Hz, would
correspond to a higher spectral modulation rate of 10 cycles/kHz, since low-pitched
sounds with lower fundamental frequencies can squeeze a larger number of
harmonics into the same frequency band.
Figure 4.1
Modulation
spectra of spoken English. Spectrograms of spoken sentences (example sentence “The radio was
playing too loudly,” shown in A) are subjected to a two-dimensional Fourier
transform (2DFT) to calculate the sentence’s modulation spectrum (C). Just as
an ordinary Fourier transform represents a waveform as a superposition of sine
waves, a 2DFT represents a spectrogram as a superposition of “spectrotemporal ripples.” The spectrograms of the ripples
themselves look like zebra stripes (B). Their temporal modulation captures the
sound’s AM, and their spectral modulation captures spectral features such as
harmonics and formants.
Adapted from figure 1 of Elliott and Theunissen (2009) ) PLoS Comput Biol
5:e1000302.
But we cannot describe every
aspect of a spectrogram solely in terms of horizontal stripes that correspond
to particular the spectral modulations. There is also variation in time.
Particularly obvious examples are the short, sharp, broadband fricative and
plosive consonants which show up as vertical stripes in the spectrogram in figure
4.1A. In the modulation spectrum of a sound, these and
other “vertical” features are captured by “temporal modulations,” and this
occurs in a relatively intuitive manner. Temporal modulations simply measure
the amount of AM at some particular modulation rate, so high-frequency temporal
modulations capture fast changes in amplitude, low temporal modulation
frequencies capture slow changes in amplitude, and at 0 Hz temporal modulation
we find the sound’s grand average (constant) signal power.
Earlier, we mentioned that,
according to Greenberg (2006), speech sounds are characterized by syllabic energy
arcs, which are between 100 and 500 ms wide. These energy arcs should
correspond to temporal modulation frequencies between 10 and 2 Hz. The fact
that almost all the signal power in the speech modulation spectrum shown in figure
4.1C appears to be contained between +
and –10 Hz temporal modulation is therefore compatible with Greenberg’s
observations.
Hang on a minute. Did I just say a
temporal modulation of minus 10 Hz?
It is relatively easy to imagine what a temporal modulation of 10 Hz might
represent: Some property of the sound gets larger and then smaller and then
larger again ten times a second. But what, you may ask, is a temporal
modulation rate of minus 10 Hz
supposed to mean? You would be right to think that, in our universe, where time
never flows backward, a sound can hardly go thorough some cyclical changes once
every minus 0.1 s. Indeed, these “negative” temporal modulation frequencies
should simply be thought of as an expedient mathematical trick that allows us
to represent acoustic features in which frequency changes over time, and which
would show up as diagonal stripes in the spectrogram. The 2DFT captures such spectrotemporal modulations with diagonal ripples, and
these diagonals come in two flavors: They either rise, or they fall. In the
convention adopted here, a spectrotemporal ripple
with a negative temporal modulation corresponds to rising frequency
trajectories, while positive temporal frequencies correspond to falling
frequencies. The fact that the modulation spectrum shown in figure
4.1C is fairly symmetrical around 0 Hz
temporal modulation, thus, tells us that, in the sample of American English
sentences analyzed here, features with rising frequency content are just as
common as features with falling FM.
Figure 4.2
Modulation
spectra of male and female English speech.
From figure 2 of Elliott and
Theunissen (2009) PLoS Comput Biol 5:e1000302.
One very useful feature of the
modulation spectrum is that it separates out low spectral frequency, that is,
spectrally broad features such as the formants of speech, from high spectral
frequency features, such as harmonic fine structure associated with pitch. The
pitch of female speech tends to be noticeably higher than that of male speech,
but the formants of female speech differ less from those of male speech, which
presumably makes understanding speech, regardless of speaker gender,
substantially easier. This is readily apparent in the modulation spectra shown
in figure 4.2,
which are rather similar for the low spectral modulations (less than 3 cycles/kHz)
associated with the broad formant filters, but much more dissimilar for the
higher spectral modulations associated with pitch.
Given the crucial role of formants
in speech, we might therefore expect the meaning of the spoken sentence to be
carried mostly in the relatively low temporal and frequency modulations, and
one rather nice feature of Elliott and Theunissen’s
(2009) use of the modulation spectrum is that they were able to demonstrate
this directly. Modulation spectra, like ordinary Fourier transforms, are in
principle “invertible,” that is, you can use the spectrum to reconstruct the
original signal. Of course, if you blank out or modify parts of the modulation
spectrum before inversion, then you remove or modify the corresponding temporal
or spectral modulations from the original speech.
By testing comprehension of
sentences after filtering out various ranges of modulations, Elliott and Theunissen (2009) were able to demonstrate that spectral
modulations of less than 4 cycles/kHz and temporal modulations between 1 and 7
Hz are critical for speech intelligibility. They refer to this as the core region
of the human speech modulation spectrum, and it sits very much in those parts
of modulation space where we would expect the formants of speech to reside.
Filtering out modulations outside this core region has only a relatively small
effect on speech intelligibility, but may make it much harder to distinguish
male from female speakers, particularly if it affects spectral modulations
between 4 and 12 cycles/kHz, which, as we have seen in
figure 4.2,
encapsulate much of the differences in male versus female voice pitch. Examples
of such “modulation filtered” speech can be found in the online material accompanying
Elliott and Theunissen’s original (2009) paper, or on
the website accompanying this book. <flag>
Many artificial speech recognition
systems use a mathematical device that is conceptually closely related to the
modulation spectrum, known as the “dynamic cepstrum.”
Like the modulation spectrum, the cepstrum is
calculated as a Fourier transform along the frequency axis of the sound’s
spectrogram. The cepstrum can then be used to
separate out the low spectral modulations that are associated with the
relatively broadly tuned resonances of the vocal tract that mark the formants
of speech, and discard the high spectral modulations associated with pitch.
Pitch, thus, seems to add little
to speech comprehension, at least for English and most Indo-European languages.
(You can find examples of speech samples with altered pitch contours which
illustrate this on the book’s web site for <flag>). But it is worth
noting that many Asian languages are tonal, meaning that they may use pitch
trajectories to distinguish the meanings of different words. To give one
example: Translated into Mandarin Chinese, the sentence “mother curses the horse” becomes “ māma mà mă .” The symbols above the “a”s are
Pinyin1 tone markers, intended to indicate the
required pitch. Thus, the “ā” is
pronounced not unlike like the “a” in the English “bark,” but with a pitch much
higher than the speaker’s normal, neutral speaking voice. The “à,” in contrast, must be pronounced with
a rapidly falling pitch contour, and in the “ă”—particularly challenging for unaccustomed Western vocal
tracts—the pitch must first dip down low, and then rise again sharply, by well
over one octave, in a small fraction of a second. Remove these pitch cues, for
example, by filtering out spectral modulations above 7 cycles/kHz,
and the sentence becomes “mamamama,” which is gibberish in both Chinese and English.
One consequence of this use of pitch to carry semantic meaning in tonal
languages is that many current Western speech processing technologies, from
speech recognition software for personal computers to speech processors for
cochlear implants, are not well adapted to the needs of approximately one
quarter of the world’s population.
Nevertheless, even in tonal
languages, the lion’s share of the meaning of speech appears to be carried in
the formants, and in particular in how formant patterns change (how they are modulated)
over time. Relatively less meaning is encoded in pitch, but that does not mean
that pitch is unimportant. Even in Western, nontonal
languages, pitch may provide a valuable cue that helps separate a speech signal
out from background noise, or to distinguish one speaker from another. (We
shall return to the role of pitch as a cue in such auditory scene analysis
problems in chapter 6.) In chapter 3, we looked in detail at how pitch
information is thought to be represented in the auditory pathway, so let us now
look in more detail at how the auditory system distinguishes different classes
of speech sounds.
One crucial, but also
poorly understood, part of the neural processing of
speech is that sounds must be mapped onto categories. The physical properties
of a sound can vary smoothly continuously. Not only can we produce /a/ sounds
and /i/ sounds, but also all manner of vowels that
lie somewhere along the continuum between /a/ and /i/.
However, perceiving a sound as “between /a/ and /i/”
is unhelpful for the purpose of understanding speech. Our brains have to make categorical
decisions. A person who is talking to us may be telling us about his “bun” or
his “bin,” but it has to be one or the other. There is no continuum of objects
between “bun” and “bin.” Once we use speech sounds to distinguish different,
discrete objects, concepts, or grammatical constructs, we must subdivide the
continuous space of possible speech sounds into discrete categories. Such categorical
perception is therefore believed to be a key step in speech processing, and it
has attracted much interest among researchers. What are the criteria that our
brains use to distinguish sound categories? Are category boundaries arbitrary
parcellations of the set of all possible speech sounds, and do different
languages draw these boundaries differently? Or are there physical or
physiological laws that dictate where the boundaries should fall? And does the
human brain comprise specialized modules for recognizing such phoneme
categories that other animals lack, or is categorical
perception of speech sounds or vocalizations also seen in other animals?
Let us first consider the question
of phoneme boundaries. One thing that is readily apparent to most students of
foreign languages is that the phoneme boundaries are not the same in all languages.
German, for example, has its umlaut—“ä,” “ü,” and “ö”—effectively a set of
additional vowels that are lacking in English. Some Scandinavian languages have
even more vowels, such as the “å” and the “ø.” Thus, English lacks certain phoneme
categories that exist in other languages, but it also makes some distinctions other
languages do not. Japanese, for example, are famously unable to distinguish
between “r” and “l” sounds. This inability is not innate, however, but emerges
during the first year of life, as children are conditioned in their mother
tongue (Kuhl et al., 2006.). While these language-specific
differences suggest that phonetic boundaries are largely determined by the
environment we grow up in, they may nevertheless not be entirely arbitrary. In
a recent review, Diehl (2008) discussed two theories, the quantal
theory and the dispersion theory, which may help explain why phonetic category
boundaries are where they are. Both theories emerge from considerations of the
physical properties and limitations of the vocal tract.
To get a feeling for the ideas
behind quantal theory, let us start with a simple
experiment that you can try yourself. Make a long “ssssssss”
sound, and then, while keeping the sound going, move the tip of your tongue
very slowly backward in your mouth so as to gradually change the sound from /s/
to /sh/. When the tip of your tongue is near your
teeth, you will make an /s/, and with the tip of your
tongue further back against your palate you will make a /sh/.
So far so good, but you may notice that placing your tongue halfway between the
/s/ and /sh/ positions does not easily produce a
sound halfway between /s/ and /sh/. As you move your
tongue steadily forward or backward you may notice a very sudden transition (a “quantum
leap”) from /s/ to /sh/ or from /sh/
to /s/. Quantal theory posits that languages avoid
using speech sounds that are very close to such quantal
boundaries in the acoustics. We do not use speech sounds somewhere between /s/
and /sh/ because they would be too difficult to
pronounce reliably. Near a quantal boundary, a small
inaccuracy in the placement of the articulators will often lead to
disproportionately large changes in the sound produced, and therefore any
category of speech sounds that happened to live very close to a quantal boundary would be particularly easy to mispronounce
and mishear. Making sure that speech sound categories keep a respectful distance
from quantal boundaries would therefore make speech
more robust.
Dispersion theory (Liljencrants & Lindblom, 1972)
takes a different approach. It starts from the realization that there are
limits to the range of speech sounds a normal human vocal tract can produce.
The theory then makes the not unreasonable assumption that, to make speech
sound categories easily distinguishable, they should be widely spread out
(dispersed) across this space of all possible speech sounds. Figure
4.3 illustrates this for the case of vowel sounds. The
coordinate axes of figure 4.3
show the first and second formant frequency of a particular vowel. The
continuous contour shows the limits of the formant frequencies that a normal
human vocal tract can easily produce, while the dots show first and second
formant frequencies for some of the major vowel categories of the world’s
languages. It, indeed, looks as if the vowels are not positioned randomly
inside the space available within the contour, but instead are placed so as to
maximize their distances, which should help make them easily distinguishable.
Figure 4.3
The contour shows the
pairings of possible F1/F2 formant frequencies, which are as distinct as they
could be, given the physical constraints of the human vocal tract. Symbols show
approximately where vowels of the world’s languages are located in this F1/F2
space.
Adapted from figure 7 of Diehl (2008)
Phil Trans Royal Soc B: Biological Sciences 363:965, with permission from the
Royal Society..
Thus, the quantal
and dispersion theories establish some ground rules for where phonetic category
boundary might fall, but there is nevertheless considerable scope for different
languages to draw up these boundaries differently, as we have seen. And this
means that at least some phoneme boundaries cannot be innate, but must be
learned, typically early in life. Interestingly, the learning of some of these
categorical boundaries appears to be subject to so-called critical or sensitive
developmental periods (Kuhl et al., 2008), so that
category distinctions that are learned in the early years of life are very hard
or impossible to unlearn in adulthood. Thus, Japanese speakers who have not
learned to differentiate between “r” and “l” early in life appear to find it
very difficult even to hear a difference between these sounds in adulthood. (We
will say more about sensitive periods in chapter 7.) However, that seems to be
an extreme case, and most learned phoneme boundaries do not bring with them an
inability to distinguish sounds that fall within a learned class. Thus, the
English language has no category boundary that distinguishes the vowels /ä/ or
/å/, yet adult native English speakers can learn very quickly to distinguish
and to recognize them.
So some phonetic category
boundaries, such as those between /r/ and /l/ or between /a/, /ä/, and /å/, are
therefore largely language specific and culturally determined, and children
pick them up in early infancy. But other phoneme boundaries seem fixed across many
languages, and may therefore be based on distinctions that are hard-wired into
the auditory system of all humans, or perhaps all mammals. For example, you may
recall that consonants differ either in their place or the manner of
articulation. Thus, /p/ and /t/ are distinguished by place of articulation (one
is made with the lips, the other with the tip of the tongue placed against the
top row of teeth), but /b/ and /p/ both have the same “labial” place of
articulation. What distinguishes /b/ from /p/ is that one is said to have a
“voiced” manner of articulation while the other is “unvoiced,” in other words
the /p/ in “pad” and the /b/ in “bad” differ mostly in their “voice onset time”
(VOT).2 In “pad,” there is a little gap of about 70
ms between the plosive /p/ sound and the onset of the vocal fold vibration that
mark the vowel /a/, while in “bad,” vocal fold vibration starts almost
immediately after the /b/, with a gap typically no greater than 20 ms or so.
Consequently, it is possible to morph a recording of the word “bad” to sound
like “pad” simply by lengthening the gap between the /b/ and the /a/. What is
curious about VOTs is that the length of VOTs does not seem to vary arbitrarily from one language to
the next. Instead, VOTs occur in no more than three distinct
classes, referred to as leading, short or long, which are conserved across the
world’s languages. In the leading category, voicing may start at about 100 ms
before the consonant, while short VOTs imply that
voicing starts 10 to 20 ms after the consonant, and long VOTs
mean that voicing starts about 70 ms after the consonant (Lisker
& Abramson, 1964). Leading voicing is not a typical feature of English, but
it is common in other languages such as Spanish, where, for example, “v” is
pronounced like a very soft /b/, so that the word “victoria” is pronounced “mbictoria.”
Several studies have shown that
animals, such as chinchillas (Kuhl & Miller 1978)
or quail (Kluender, Diehl, & Killeen, 1987), can be easily trained to discriminate stop consonants
with short or long VOTs. Thus the three VOT
categories may be “linguistic universals” because they are based on acoustic
distinctions that are particularly salient for the auditory systems not just of
humans but also of other animals.
You may recall from section
2.4, and in particular from figure 2.13, that the
inner ear and auditory nerve (AN) are thought to operate like a filter bank,
and firing rates along the tonotopic array of the
auditory nerve fiber bundle create a sort of “neurogram,”
a rate-place code for the acoustic energy distribution in the incoming sound.
The frequency resolution in that tonotopic rate-place
code is not terribly sharp, but it is easily sharp enough to capture formant
peaks. When we introduced the neurogram notion in figure
2.13, however, we did gloss over a small complication, which we now ought to
come clean about. We mentioned only in passing in section 2.4 that the majority
(roughly 80%) of AN fibers are high spontaneous rate
fibers that saturate—that is, they cannot fire any faster, once the sound at
their preferred frequency reaches a level of between 30 and 50 dB SPL. Over 30
years ago, Young and Sachs (1979) had already pointed out that this rate
saturation can have awkward consequences for the place-rate representation of
formants in the auditory nerve. Figure 4.4
illustrates some of their findings from a set of experiments in which they
recorded AN responses to artificial vowel sounds presented
at different sound levels.
Figure 4.4
Responses
in the auditory nerve of the cat to a steady-state artificial vowel /I/. (A) power
spectrum of the vowel sound. It exhibits harmonics every 128 Hz, and formants
at approximately 0.4, 2, and 2.8 kHz. (B) Nerve fiber responses when the vowel
is presented at a relatively quiet 34 dB SPL. Each dot is the normalized evoked
firing rate of a single nerve fiber, plotted against each fiber’s CF. The
continuous line is a moving average along the frequency axis. The observed
nerve discharge rate distribution exhibits clear peaks near the stimulus
formant frequencies. (C) Nerve fiber responses when the sound is presented at a
moderately loud 64 dB SPL. Due to saturation of nerve fiber responses, the
peaks in the firing distributions are no longer clear. Based
on data published in Young and Sachs (1979).
Figure 4.4A shows the power spectrum of the
stimulus Sachs and Young used: an artificial vowel with harmonics every 128 Hz,
passed through a set of formant filters to impose formant peaks at about 400 Hz,
as well as at about 2,000 and 2,800 Hz. The resultant sound is not too different
from the human vowel /I/ (a bit like the “i” in
“blitz”). Note that the figure uses a logarithmic frequency axis, which
explains why the harmonics, spaced at regular 128-Hz intervals, appear to
become more densely packed at higher frequencies. Sachs and Young recorded
responses from many AN fibers in the anesthetized cat
to presentations of this sound at various sound levels. Figure
4.4B summarizes their
results for presentations of the artificial vowel at an
intensity of 34 dB SPL. Each point in figure
4.4B shows the response for a single nerve fiber.
The x-coordinate shows the nerve fiber’s characteristic frequency (CF), and the
y-coordinate shows the nerve fiber’s firing rate, averaged over repeated
presentations of the stimulus at 34 dB SPL, and normalized by subtracting the
nerve fiber’s spontaneous firing rate and dividing by the nerve fiber’s maximal
firing rate in response to loud pure tones at its CF. In other words, a normalized
rate of 0 means the neuron fires no more strongly than it would in complete
quiet, while a normalized rate of 1 means it fires almost as strongly as it
ever will. The gray continuous line in figure
4.4B shows a moving average across the observed
normalized firing rates for the AN fibers. This
averaged normalized firing rate as a function of CF appears to capture the
formant peaks in the stimulus quite nicely. So what’s the problem?
The problem is that 34 dB SPL is
really very quiet. Just the background hum generated by many air conditioning
systems or distant traffic noise will have higher sound levels. Most people
converse with speech sounds of an intensity closer to 65 to 70 dB SPL. But when
Sachs and Young repeated their experiment with the artificial vowel presented
at the more ”natural” sound level of 64 dB SPL, the firing rate distribution in
the auditory nerve was nothing like as pretty, as can be seen in figure 4.4C. The firing rate
distribution has become a lot flatter, and the formant peaks are no longer
readily apparent. The problem is not so much that the peaks in the firing rate
distribution at the formants have disappeared, but rather that the valley
between them has filled in. This is a classic example of the so-called dynamic
range problem. Most AN fibers saturate at relatively
modest sound intensities, and sounds don’t have to become very loud before the
AN fibers lose their ability to signal spectral contrasts like those between
the formant peaks and troughs in a vowel.
The curious thing, of course, is
that even though the representation of the formant peaks across the nerve fiber
array appears to become degraded as sound levels increase, speech sounds do not
become harder to understand with increasing loudness—if anything the opposite
is true. So what is going on?
There are a number of possible
solutions to the dynamic range problem. For example, you may recall from
chapter 2 that AN fibers come in two classes: High
spontaneous rate (HSR) fibers are very sensitive and therefore able to respond
to very quiet sounds, but they also saturate quickly; low spontaneous rate
(LSR) fibers are less sensitive, but they also saturate not nearly as easily.
The plots in figure 4.4
do not distinguish between these classes of AN fibers.
Perhaps the auditory pathway uses HSR fibers only for hearing in very quiet
environments. HSR fibers outnumber LSR fibers about four to one, so the large
majority of the nerve fibers sampled in figure
4.4 are likely to be HSR fibers, and using those to
encode a vowel at 64 dB might be a bit like trying to use night vision goggles
to see in bright daylight.
Young and Sachs (1979) also
proposed an alternative, perhaps better explanation when they noticed that,
even though the nerve fibers with CFs between 300 and
3,000 Hz shown in figure 4.3C may all fire at similarly high
discharge rates, they tend to phase lock to the formant frequencies close to
their own CF. For example, you might find a 500-Hz fiber that is firing
vigorously, but at an underlying 400-Hz rhythm. In that case you might conclude
that the dominant frequency component in that frequency range is the frequency
signaled by the temporal firing pattern (400 Hz) even if this is not the nerve
fiber’s preferred frequency (500 Hz). Such considerations led Young and Sachs
(1979) to propose a response measure that takes both firing rate and phase
locking into account. This response measure, the “average localized synchronized
rate” (ALSR), quantifies the rate of spikes that are locked to the CF of the
neuron. In the previous example, the ALSR would be rather low for the 500-Hz
neuron, since most spikes are synchronized to the 400-Hz formant. The ALSR
measure of auditory nerve discharges reflects formant frequencies much more
stably than ordinary nonsynchronized rate-place codes
could.
Whether your auditory brainstem
solves the dynamic range problem by computing the ALSR, by listening
selectively either to HSR or to LSR fibers depending on sound level, or by
relying on some other as yet unidentified mechanism is not known. However, we
can be pretty certain that your auditory brainstem does solve this problem, not
only because your ability to understand speech tends to be robust over wide
sound level ranges, but also because electrophysiological recordings have shown
that so-called chopper neurons in the ventral cochlear nucleus can represent
formants in a much more sound level invariant manner than the auditory nerve
fibers do (Blackburn & Sachs, 1990).
At the level of the auditory
brainstem, as well is in the major midbrain nuclei such as the inferior colliculus or the medial geniculate,
and to some extent even in primary auditory cortex, this representation is
thought to retain a somewhat spectrographic character. The pattern of neural
discharges mostly reflects the waxing and waning of acoustic energy in the
particular frequency bands to which these neurons happen to be tuned. But much
experimental evidence suggests that this representation is not very isomorphic,
in the sense that neural firing patterns in the brainstem and midbrain do not
simply and directly reflect the rhythms of speech. Nor do the temporal response
properties of neurons in the midbrain or cortex appear to be tuned to match the
temporal properties of speech particularly well. Evidence for this comes from
studies like those by Miller and colleagues (2002), who have analyzed response
properties of midbrain and cortex neurons using synthetic dynamic ripple
stimuli and reverse correlation. The dynamic ripple sounds they used are
synthetic random chords that vary constantly and randomly in their frequency and
amplitude. The rationale behind these experiments rests on the assumption that,
at some periods, just by chance, this ever changing stimulus will contain
features that excite a particular neuron, while at other times it will not. So,
if one presents a sufficiently long random stimulus, and then asks what all
those stimulus episodes that caused a particular neuron to fire had in common,
one can characterize the neuron’s response preferences. Often this is done by a
(more or less) simple averaging of the spectrogram of the stimulus episodes
that preceded a spike, and the resulting spike-triggered average of the
stimulus serves as an estimate of the neuron’s spectrotemporal
receptive field (STRF).
You may recall from our
discussions in chapter 2 that auditory neurons can be modeled, to a coarse
approximation, as linear filters. For example, in figure 2.12, we illustrated
similarities between auditory nerve fibers and so-called gamma-tone filters.
Now, in principle, we can also try to approximate auditory neurons in the
central nervous system with linear filter models (only the approximation risks
becoming ever cruder and more approximate as each level of neural processing
may contribute nonlinearities to the neural response properties). The way to
think of a neuron’s STRF is as a sort of spectrographic display of the linear
filter that would best approximate the neuron’s response properties.
Consequently, we would expect a neuron to fire vigorously only if there is a
good match between the features of the STRF and the spectrogram of the
presented sound.
Figure 4.5A shows such an example of an STRF
estimated for one neuron recorded in the auditory midbrain of the cat. The STRF
shows that this particular neuron is excited by sound at about 9 kHz and the
excitation kicks in with a latency of about 10 ms.
There are inhibitory frequency regions both above and below the neuron’s
preferred frequency. But we also see that the excitatory region near 9 kHz is followed
by “rebound inhibition,” which should shut off this neuron’s firing after about
30 ms of response. The STRF would thus predict that continuous, steady-state
sounds are not particularly effective stimuli for this neuron.
In section 4.1, we discussed the
dynamic properties of speech, and saw how its characteristic amplitude and
frequency modulations can be captured by its modulation spectrum. We also saw how
the modulation spectrum is generated from the sound’s spectrogram by two-dimensional
Fourier transformation. Now, if the STRF of a neuron is a sort of
spectrographic display of the spectral and temporal features that a neuron
prefers, we ought to be able to apply a similar thought process here, and
transform the neuron’s STRF with a 2DFT to reveal what sort of spectral and
temporal modulations the neuron might respond to with particular vigor. To use
the technical term, we can use the 2DFT of the STRF to obtain the neuron’s “modulation
transfer function” (MTF). Figure 4.5B
shows the MTF obtained in this manner from the STRF shown in figure 4.5A.
Figure 4.5
(A) Spectrotemporal receptive field of a neuron recorded in the
auditory thalamus of the cat. BW: spectral band width, BF: best frequency, Lat:
response latency. (B) Modulation transfer function for the same neurons. BSM, best
spectral modulation, BTM, best temporal modulation.
Reproduced from figure 1 of Miller et
al. (2002), with permission from the American Physiological Society..
When many neurons are
characterized in this manner, it is possible to investigate what range of
temporal and spectral modulations the population of neurons, on average, might
be able to represent effectively through equivalent modulations of their own
firing rates. We can then ask, is this population MTF well
matched to the modulations encountered in speech? Figure
4.6 shows population MTFs
determined in this manner for the main auditory thalamic relay station to the
cortex, the ventral part of the medial geniculate (MGv), and for the primary auditor cortex of the cat.
Figure 4.6
(A) Population MTF of
the auditory thalamus (MGv) cat. The dots show best
temporal and spectral modulation values for individual neurons. The gray values
show the averaged MTF. (B) Population MTF for cat A1.
Reproduced from figure
8 of Miller et al. (2002), with permission from the American Physiological
Society.
Comparing the population MTFs of auditory thalamus and cortex shown in figure 4.6 to the speech modulation
spectra shown in figures 4.1C and 4.2,
we notice that the neural MTFs at these, still
relatively early, auditory processing stations are not obviously well matched
to the temporal modulations characteristic of speech. However, it would
probably have been surprising if they were. For starters, the MTFs shown in figure
4.6 were recorded in cats, and the cat midbrain is
hardly likely to be optimized for the processing of human speech. However, we
have no reason to assume that the population MTFs of
human thalamus or primary auditory cortex would look much different. The shape
of the population MTF can provide some insights into the nature of the neural
code for sounds in the midbrain that are likely to be true for most mammals.
The best temporal modulation frequencies for these neurons are often higher
than they would need to be for the purposes of speech encoding (extending out
to 50 Hz and above while speech modulations rarely exceed 30 Hz). In contrast,
the best frequency modulations are not nearly high enough to capture the pitch
part of the speech modulation spectrum, but they do appear quite well matched
to the frequency modulation spectra of formants. We saw in chapters 2 and 3
that, already at the level of the auditory nerve, frequency tuning of individual
nerve fibers is not usually sharp enough to resolve the harmonics that would
reveal the pitch of a periodic sound, and the low-pass nature of the population
MTF suggests that midbrain neurons cannot resolve harmonics in the sound’s
spectrum either.
Another striking mismatch between
the speech modulation spectrum and the neural population MTF can be seen at
temporal modulation frequencies near zero. Speech modulation spectra have a lot
of energy near 0-Hz temporal modulation, which reflects the fact that, on
average, there is some sound almost all the time during speech. However, as one
ascends the auditory pathway toward cortex, auditory neurons appear
increasingly unwilling to respond to sustained sound with sustained firing.
Instead, they prefer to mark sound onsets and offsets with transient bursts of
firing, and then fall silent. Slow temporal modulations of their firing rates
are consequently rare. With this emphasis on change in the spectrum, rather
than sustained sound energy levels, the representation of sounds in the
midbrain and above resembles the derivative of the spectrogram with respect to
time, but since speech sounds are rarely sustained for long periods of time,
this emphasis on time-varying features does not change the representation
dramatically. In fact, as we shall see later, even at the level of the primary
visual cortex, speech sounds appear to be represented in a manner that is
perhaps more spectrographic (or neurographic) than
one might expect. Consequently, much interesting processing of speech remains
to be done when the sounds arrive at higher-order cortical fields. Where and
how this processing is thought to take place will occupy us for much of the
rest of this chapter.
Before we start our
discussion of the cortical processing of speech sounds in earnest, let us
quickly revise some of the key anatomical terms, so we know which bit is which.
Figure 4.7 shows anatomical drawings of the human
cerebral cortex. Figure 4.7A
shows the left side of the cortex, reminding you that the cortex on each
hemisphere is subdivided into four lobes: The occipital lobe at the back deals
mostly with visual processing; the parietal lobe deals with touch, but also
integrates information across sensory modalities to keep track of the body’s
position relative to objects and events around us; the frontal lobe is involved
in planning and coordinating movements, short-term working memory, and other cognitive
functions; and, finally, the temporal lobe is involved in hearing, but also in
high-level vision, general object recognition, and the formation of long-term
memories.
In the human brain, many of the
auditory structures of the temporal lobe are not visible on the surface, but
are tucked away into the sylvian fissure, which forms
the boundary between the temporal, frontal and parietal lobes. Figure
4.7B
therefore shows a view of the human brain from above, with the left frontal and
parietal lobes cut away to show the upper bank of the temporal lobe. Most of
the auditory afferents from the thalamus terminate in Heschl’s
gyrus, where primary auditory cortex is found in
humans. To the front and back of Heschl’s gyrus lie the planum polare and the planum temporale, where second-order auditory belt areas are
situated.
Figure 4.7
(A) Drawing of a
lateral view of the human brain, showing the four principal lobes. (B) Human
cortex seen from above, with the frontal and parietal lobe cut away to expose
the superior bank of the temporal lobe, where the primary auditory cortex (Heschl’s gyrus) and auditory belt
areas (planum temporale and
polare) are situated. (C) Lateral view showing higher-order
cortical areas commonly associated with higher-order auditory processing and
speech.
Original artwork kindly provided by
Jo Emmons (www.joemmons.com).
But the processing of speech
certainly also involves cortical areas well beyond the auditory areas on the
upper bank of the temporal lobe. A few of the other key areas are shown in figure 4.7C,
including the superior temporal gyrus (STG) and sulcus (STS), and Broca’s and Wernicke’s areas. The latter two are both named after nineteenth-century
neurologists who associated damage to these areas with disturbances of either
speech production (Broca’s aphasia) or speech
comprehension (Wernicke’s aphasia. See the book’s
website for short video clips showing patients with Broca’s
and Wernicke’s aphasia.) Note that the definitions of
Broca’s and Wernicke’s
areas are not based on anatomical landmarks, but instead derived from case
studies of patients with injuries to these parts of the brain. Since the damage
caused by such injuries is rarely confined to precisely circumscribed regions, the
exact boundaries of Broca’s and Wernicke’s
areas are somewhat uncertain, although a consensus seems to be emerging among neuroanatomists that Broca’s area
should be considered equivalent to the cytoarchitectonically
defined and well-circumscribed Brodmann areas 44 and
45. In any case, both Wernicke’s and Broca’s areas clearly lie either largely or entirely
outside the temporal lobe, which is traditionally associated with auditory
processing. Note also that, while both hemispheres of the cortex have frontal,
parietal, occipital, and temporal lobes, as well as Heschl’s
gyri and superior temporal sulci,
much clinical evidence points to the left hemisphere playing a special role in
speech processing, and Broca’s and Wernicke’s areas appear normally to be confined largely or
wholly to the left hemisphere.
Paul Broca, who lived from 1824 to 1880, and in whose honor one
of the brain areas we just encountered is named, was one of the first to
observe that speech processing may be asymmetrically distributed in cortex. He
stated that the left hemisphere was “dominant” for language. Broca chose his words carefully. The left hemisphere’s
dominance is not meant to imply that the right hemisphere contributes nothing
to speech comprehension or production. Rather, the left hemisphere, in most but
not all individuals, is capable of carrying out certain key speech processing
functions even if the right hemisphere is not available to help, but the right
hemisphere on its own would not succeed. We might envisage the situation as similar
to that of a lumberjack who, if forced to work with one hand only, would be
able to wield his axe with his right hand, but not with his left. This “righthand-dominant lumberjack” would nevertheless work at
his best if allowed to use both hands to guide his axe, and his left hand would
normally be neither idle nor useless.
Since Broca’s
time, a wealth of additional clinical and brain imaging evidence has much
refined our knowledge of the functional roles of the two brain hemispheres and
their various areas in both the production and the comprehension of speech.
Much of that work was nicely summarized in a review by Dana Boatman (2004),
from which we present some of the highlights.
Much of the research into human
speech areas has been driven forward by clinical necessity, but perhaps
surprisingly not as much from the need to understand and diagnose speech
processing deficits as from the desire to cure otherwise intractable epilepsy
by surgical means. In these epilepsy surgeries, neurosurgeons must try to
identify the “epileptic focus,” a hopefully small piece of diseased brain
tissue that causes seizures by triggering waves of uncontrollable hyperexcitation which spread through much of the patient’s
brain. Successful identification and removal of the epileptic focus can cure
the patient of a debilitating disease, but there are risks. For example, if the
operation were to remove or damage one of the brain’s crucial speech modules, the
patient would be left dumb or unable to understand speech. That would be a
crippling side effect of the operation one would like to avoid at all cost. So,
the more we know about the location of such crucial brain regions, the better the
surgeon’s chances are to keep the scalpel well away from them.
One complicating factor which neurosurgeons
have appreciated for a long time is that the layout of cortex is not absolutely
identical from one person to the next, and it is therefore desirable to test each
individual patient. One such test that has been administered frequently since
the 1960s is the so-called Wada procedure (Wada & Rasmussen, 1960), during
which a short-acting anesthetic (usually sodium amytal)
is injected into the carotid artery, one of the main blood supply routes for
the cerebral hemispheres of the brain. After injecting the anesthetic on either
the left or right side only, one can then try to have a conversation with a
patient who is literally half asleep, because one of his brain hemispheres is
wide awake, while the other is deeply anesthetized. Records of such Wada tests
have revealed that approximately 90% of all right-handed patients and about 75%
of all left-handed patients display Broca’s classic
“left hemisphere dominance” for speech. The remaining patients are either
“mixed dominant” (i.e., they need both hemispheres to process speech) or have a
“bilateral speech representation” (i.e., either hemisphere can support speech
without necessarily requiring the other). Right hemisphere dominance is
comparatively rare, and seen in no more than 1 to 2% of the population.
The Wada procedure has its
usefulness—for example, if we needed to perform surgery on the right brain
hemisphere, it would be reassuring to know that the patient can speak and
comprehend speech with the spared left hemisphere alone. However, often one
would like more detailed information about the precise localization of certain
functions than the Wada test can provide. To obtain more detailed information,
neurosurgeons sometimes carry out electrocortical
mapping studies on their patients. Such mappings require preparing a large part
of one of the patient’s brain hemispheres for focal electrical stimulation
either by removing a large section of the skull to make the brain accessible
for handheld electrodes, or by implanting a large electrode array over one of
the hemispheres. During the actual mapping, the patient receives only local
anesthetic and analgesics, and is therefore awake and can engage in
conversation or follow simple instructions.
The patients are then tested on
simple speech tasks of varying level of complexity. The simplest, so called acoustic-phonetic
tasks, require only very simple auditory discriminations; for example, the
patient is asked whether two syllables presented in fairly quick succession are
the same or different. The next level, so called phonological tasks, require a slightly deeper level of analysis of the
presented speech sounds. For example, the patient might be asked whether two
words rhyme, or whether they start with the same phoneme. Note that neither acoustic-phonetic nor phonological tasks requires
that the tested speech sounds be understood. For example, I can easily repeat
the syllable “shmorf,” I can tell that it rhymes with
“torf,” and that “shmorf”
and “torf” do not start with the same phoneme. I can
do all this even though both “shmorf” and “torf” are completely meaningless to me. The ability to use
speech sounds for meaningful exchanges requires a further so-called lexical-semantic
level of analysis, which typically involves asking a patient to carry out
simple instructions (such as “please wiggle the ring finger on your left hand”)
or to answer questions of varying level of grammatical complexity.
While the patients are grappling
with these acoustic, phonological, or semantic tasks, the surgeon will sneakily
send small bursts of electric current to a particular spot on their brain. This
current is just large enough to disrupt the normal activity of the neurons in
the immediate vicinity of the stimulating electrodes, and the purpose of this
is to test whether this highly localized disruption makes any obvious
difference to the patient’s ability to perform the task.
In such electrocortical
mapping studies, one does observe a fair degree of variation from one patient
to another, as no two brains are exactly alike. But one can nevertheless observe
clear trends, and Dana Boatman (2004) has summarized which parts of cortex
appear to be essential for acoustic, phonological, or semantic tasks across a
large numbers of patients. The results of her analysis are shown in figure 4.8.
Figure 4.8
Sites
where acoustic (A), phonological (B), or lexical-semantic (C) deficits can be
induced by disruptive electrical stimulation. The light gray symbols show locations on perisylvian cortex that were tested by applying disruptive
electrical stimulation. The black symbols show sites where such stimulation
interfered with the patient’s ability to perform the respective task.
Reproduced from figures 1 through 3
of Boatman (2004),. Cognition 92:47-65., with
Copyright (2004) permission from Elsevier.
The data in figure 4.8 suggest a hierarchical
arrangement. The more complex the task, the larger the number
of cortical sites that seem to make a critical contribution because disruptive
stimulation at these sites impairs performance. Acoustic-phonetic tasks
(figure 4.8A)
are not easily disrupted. Only at a single spot on the superior temporal sulcus (STS) could electrical stimulation reliably
interfere with phonetic processing in all patients. Phonological processing (figure 4.8B) requires a greater
degree of analysis of the speech sounds, and it seems to involve large parts of
STS, as well as some points on Broca’s area on the
frontal lobe, since focal stimulation of any of these areas impairs
performance. Lexical-semantic tasks (figure
4.8C) are yet more complex, and seem to involve yet
more cortical territory because they are even more vulnerable to disruption.
Focal stimulation not just of the superior temporal gyrus
(STG), STS, and Broca’s area, but also of Wernicke’s area in the parietal lobe can disrupt the performance
of this type of tasks.
In figure
4.8 we also notice that the sites where one can
disrupt processing on the next higher level of complexity always appear to
include the sites that were involved in the lower processing levels. That is
perhaps unsurprising. If some focal electrical stimulation perturbs my
perception of speech sounds to the point where I can no longer tell whether two
words spoken in sequence were the same or different, then it would be odd if I
could nevertheless tell whether those word rhymed, or what they meant. XXX
The clinical data thus suggests a
cortical processing hierarchy, which begins with acoustic-phonetic processing
in or near primary auditory cortex, and engages ever-increasing amounts of
cortical territory as the brain subjects vocalizations
to phonological and semantic analysis. But the clinical data cannot provide
much detail on what exactly each particular cortical area contributes to the
process. For example, the fact that semantic processing of sounds can be
disrupted by electrical stimulation of parts of Wernicke’s
area does not mean that important steps toward this semantic processing may not
have already begun at much earlier levels in the cortical hierarchy. In fact,
some results from animal research might be interpreted as evidence for
“semantic preprocessing” from the earliest levels.
Since semantic
processing involves finding the “meaning” of a particular speech sound or
animal vocalization, one can try to investigate semantic processing by
comparing neural responses to “meaningful” sounds with responses to sounds that
are “meaningless” but otherwise very similar. One simple trick to make speech
sounds incomprehensible, and hence meaningless, is to play them backward. Time
reversing a sound does not change its overall frequency content. It will flip
its modulation spectrum along the time axis, but since speech modulation
spectra are fairly symmetrical around t
= 0 (see figure 4.1C), this does not seem to matter much. Indeed, if you have
ever heard time-reversed speech, you may know that it sounds distinctly speechlike, not unlike someone talking in a foreign
language (You can find examples of such time reversed speech in the book’s
website <flag>). Of
course, one can also time reverse the vocalizations of other animals, and
indeed, in certain songbird species, brain areas have been identified in which
neurons respond vigorously to normal conspecific songs, but not to
time-reversed songs (Doupe and Konishi,
1991). Typically, the songbird brain areas showing such sensitivity to time
reversal seem to play an important role in relating auditory input to motor
output, for example, when a bird learns to sing or monitors its own song.
Interestingly, Xiaoqin
Wang and colleagues (1995) have used the same trick in marmosets, a species of
new world monkey, and found that already in primary auditory cortex, many
neurons respond much more vigorously to natural marmoset twitter calls than to
time-reversed copies of the same call. Could it be that marmoset A1 neurons
fire more vigorously to the natural calls because they are “meaningful,” while
the time-reversed ones are not? If the same natural and time-reversed marmoset
calls are presented to cats, one observes no preferential responses in their A1
for the natural marmoset calls (Wang & Kadia,
2001), perhaps because neither the natural nor the reversed marmoset calls are
particularly meaningful for cats.
However, the interpretation of
these intriguing data is problematic. One complicating factor, for example, is
the fact that the relationship between the number of spikes fired by some
neuron during some relatively long time interval and the amount of information
or meaning that can be extracted from the neuron’s firing pattern is not
straightforward. A more vigorous response does not necessarily convey
proportionally more information. This was clearly illustrated in a study by Schnupp and colleagues (2006), who used the same marmoset
calls as those used by Wang et al. (1995), but this time played them either to
naïve ferrets, or to ferrets who had been trained to recognize marmoset twitter
calls as an acoustic signal that helped them find drinking water. For the
trained ferrets, the marmoset calls had thus presumably become “meaningful,”
while for the naïve ferrets they were not. However, neither in the naïve nor
the trained ferrets did primary auditory cortex neurons respond more strongly
to the natural marmoset calls than to the time-reversed ones. Instead, these
neurons responded vigorously to either stimuli, but many of these neurons
exhibited characteristic temporal firing patterns, which differed
systematically for different stimuli. These temporal discharge patterns were
highly informative about the stimuli, and could be used to distinguish
individual calls, or to tell normal from time-reversed ones. However, these
neural discharge patterns had to be “read out” at a temporal resolution of 20
ms or finer; otherwise this information was lost. Figure
4.9 illustrates this. Schnupp
and colleagues (2006) also showed that training ferrets to recognize these
marmoset vocalizations did not change the nature of this temporal pattern code,
but did make it more reliable and hence more informative.
Figure 4.9
(A, B) Spectrograms of
two marmoset “twitter calls.” (C, D) Dot rasters showing responses of a neuron in ferret
primary auditory cortex to these sounds. Each dot represents one nerve impulse,
each row of dots an impulse train fired in response to a single presentation of
the corresponding stimulus. The neuron fires similar mean spike counts but with
different temporal discharge patterns in response to each stimulus. (E, F)
Responses shown in C and D are plotted as histograms, showing the mean firing
rate poststimulus onset, with the responses to
stimulus 1 shown in gray, those to stimulus 2 in black. At fine temporal
resolutions (small histogram bin width, e.g., 20 ms shown in E) the differences
in the response patterns are very clear and, as shown by Schnupp
et al. (2006), contain much information about stimulus identity. However, at
coarser temporal resolutions (300 ms bin width, shown in F), the responses look
very similar, and information about stimulus identity is lost.
These results indicate that the
representation of complex stimuli like vocalization or speech at early stages
of the cortical processing hierarchy is still very much organized around
“acoustic features” of the stimulus, and while this feature-based
representation does not directly mirror the temporal fine structure of the
sound with submillisecond precision, it does
nevertheless reflect the time course of the stimulus at coarser time
resolutions of approximately 10 to 20 ms. It may or may not be a coincidence
that the average phoneme rate in human speech is also approximately one every
20 ms, and that, if speech is cut into 20-ms-wide strips, and each strip is
time-reversed and their order is maintained, speech remains completely
comprehensible (Saberi & Perrott,
1999) (A sound example demonstrating this can be found on the book’s website.
<flag>).
Further evidence for such a
feature-based representation of vocalizations and speech sounds in mammalian A1
comes from a recent study by Engineer and colleagues (2008), who trained rats
to recognize consonants of American English. The rats were trained to
distinguish nonsense syllables that differed only in their onset consonant:
“pad” from “bad,” “zad” from “shad,” “mad” from “nad,” and so on. Some of these distinctions the rats
learned very easily, while they found others more difficult. Engineer and
colleagues then proceeded to record responses to these same syllables from
hundreds of neurons in the auditory cortex of these animals. These responses
are reproduced here in figure 4.10 as neurogram-dot raster displays. Each panel shows the
responses of a large number of primary auditory cortex neurons, arranged by
each neuron’s best frequency along the y-axis. The x-axis shows time after stimulus
onset. The panels zoom in on the first 40 ms only to show the response to the
onset consonant.
Figure 4.10
Responses
or rat A1 neurons to 20 different consonants.
Adapted by permission from Macmillan Publishers, Ltd: Nature
Neuroscience: Figs 1 and 2 of Engineer et al. (2008) Nat Neurosci
11:603-608., copyright (2008)
Figure 4.10 shows that the A1 neurons
normally respond to the onset syllable with one or occasionally with two bursts
of activity, but the bursts do not all start at the same time, nor are they all
equally strong. Instead, they vary systematically, depending on the sound
stimulus and the neuron’s frequency tuning. In fact, the firing pattern is
still very much like the neurogram responses we saw
in figure 2.13 for auditory nerve fiber
responses. When presented with an /m/ or an /n/, which
contain little acoustic energy at high frequencies, the high-frequency A1 neurons
fail to fire. Conversely, in response to an /s/, which
contains little energy at low frequencies, only the high-frequency neurons
respond. This interplay between frequency sensitivity and the acoustic
properties of the stimulus leads to each vowel having its own response pattern
in across the population of cortical neurons. Interestingly, when Engineer and
colleagues (2008) used pattern classifier algorithms similar to those used by Schnupp et al. (2006) to quantify the differences between
the cortical activity patterns evoked by the different vowels, they noticed
that these differences predicted how easily a rat would learn to distinguish
the sounds. Thus, /m/ and /n/ evoked rather similar response patterns, and rats
found it very hard to distinguish them, but /p/ and /b/ evoked rather different
response patterns, and rats learned to distinguish the sounds easily.
The responses to /p/ and /b/ shown
in figure 4.9
are, in fact, particularly interesting, because they exhibit a phenomenon that had
previously been described by Steinschneider, Fishman,
and Arezzo (2003) in the primary auditory cortex of rhesus monkeys and by Eggermont (1995) in cortex of cats. In response to /p/, the
low- to mid-frequency neurons produce two bursts of impulses, while in response
to /b/ they produce just one. In fact, the second burst of action potentials in
the response to /p/ is not strictly a response to /p/, but a response to the
onset of voicing, to the /a/ in the syllables “pad” and “bad” that were
presented to the animals. There is, of course, an /a/ in
“bad” as well, yet the response to it is suppressed in the low- and mid-frequency
neurons, probably due to a phenomenon of “forward masking.” You may recall from
section 4.2 that the key distinguishing feature between /p/ and /b/ is that the
former has a longer VOT; that is, in “pad” the gap between the consonant and
the vowel may be some 60 to 70 ms long, while in “bad” it may be no longer than
20 ms. The longer gap in “pad” gives the neurons time to recover from forward
masking and to respond vigorously to both the /p/ and the /a/, whereas in
“bad,” forward masking much reduces the response to the /a/. Neural response
patterns tend to transition fairly abruptly from single-peaked to double-peaked
responses when a “ba” sound is morphed to a “pa” by
lengthening the VOT. Forward masking is thus one aspect of neural processing that
leads to a deviation in the neural firing patterns from what might be expected
on the basis of a purely spectrographic representation, and it may be the
responsible for the categorical perceptual boundaries associated with VOTs that we discussed in section 4.2. Thus, although
responses at the earliest cortical processing levels appear to represent purely
acoustic-phonetic aspects of vocalizations, their response properties may
nevertheless account for at least some aspects of categorical perception.
As we have just seen,
the representation of animal vocalizations or speech sounds in early stages of
auditory cortex still appears to be fairly “raw”, and rather directly related
to physical stimulus attributes. A somewhat categorical distinction between /p/
and /b/ based on a relative suppression of a response to the voice onset seems
to be as good as it gets. One might reasonably assume that neural responses would
become fairly specific if they reflected the result of lexical-semantic
processing, yet most studies indicate that neurons in early cortical processing
stages are not very selective but respond more or less vigorously to all manner
of vocalizations as well as to other sounds from inanimate objects. It
therefore looks as if many of the most interesting aspects of speech and
vocalization processing apparently occur “beyond” the primary auditory fields.
Unfortunately, when it comes to
the study of speech and vocalization processing in higher-order cortical areas,
obtaining the very detailed data against which to test particular theories is very
difficult. One of the main experimental approaches for this type of work is noninvasive
functional imaging in normal human volunteers, using techniques like positron
emission tomography (PET) or functional magnetic resonance imaging (fMRI). These approaches can yield some interesting results.
For example, a study by Scott and colleagues (2000) provided intriguing
evidence that parts of the left anterior STG may be activated selectively by
intelligible speech. This conclusion was based on a comparison of cortical
activation patterns obtained either with normal speech sounds or with speech rendered
unintelligible by inverting sound along the frequency axis (You can find an
example of such spectrally rotated speech on the book’s website <flag>).
However, as Scott herself points
out in a comprehensive review (Scott & Wise, 2004), word deafness (i.e.,
the inability to recognize the meaning of words) only rarely results from
damage to the left STG alone, and usually occurs only in patients who suffered
injury to the STG on both sides. This highlights one methodological problem
inherent in functional brain imaging studies. When a particular area of the
brain appears to “light up” under the scanner, what we really see is a
marginally greater blood supply to this area during some particular stimulus
regime than during another. As a measure of brain activity, this is very
indirect. One problem is that it reveals only the tip of the iceberg; it shows
which brain area blushed significantly more under the effort of the neural
processing it carried out than some neighboring area. And here “significantly more” is to be understood
in the statistical sense, meaning that the difference can be measured with a
fair degree of confidence, not that it is very large. Neighboring areas may
have made crucial contributions to the processing, but these fail to show up in
the functional scan because they were performed “relatively effortlessly.”
Another limitation of fMRI and PET stems from their inherently poor temporal
resolution, as they effectively measure responses of the brain’s vasculature that
reflect the relatively slowly changing metabolic demands of the neural tissue.
Consequently, fMRI and PET cannot resolve any brain processes
that occur on timescales faster than a few seconds. As we saw in the previous
section, deciphering the cortical code is likely to require a temporal
resolution approximately 1,000-fold faster. Consequently, a number of elegant
theories that have recently emerged remain largely untestable
with functional imaging techniques. For example, it has been suggested that
certain brain areas may be specialized for processing slow aspects of speech,
such as “prosody”—that is, the overall melody and rhythm of a speech, which
conveys emotional undertones or labels sentences as questions or statements—while
other brain areas may specialize in processing fast features, such as formant
transitions that identify individual speech sounds (Poeppel
& Hickok, 2004). Whether, or to what extent, this is really true we will know
only when techniques that provide more direct observations of neural activity
on a millisecond timescale become widely available.
Detailed and direct observations
are, of course, possible in animal experiments, where microelectrodes can be
implanted directly in the areas of interest. However, the layout of high-order
cortical fields may not be identical from one species of mammal to the next,
and humans not only have a uniquely rich, complex vocal communication system,
they also have substantially larger cortices than almost any other mammal. Studies
carried out on some of our primate cousins, such as rhesus macaques, may
nevertheless provide interesting insights that are likely to be representative
of the processes we would expect to take place in human brains.
Based on anatomical observations,
it has been suggested that auditory cortical areas in primate brain may be
organized into two more or less discrete processing streams (Romanski et al., 1999): a dorsal stream, which may be
concerned mostly with identifying sound source locations, and a ventral stream,
which is thought to play the lead role in identifying sounds. Anatomical
evidence from tracer studies indicates that this ventral stream should run from
primary auditory cortex via medial belt (secondary) areas to anterior STG and
inferotemporal cortex, and from there finally to areas
in the ventral prefrontal cortex (vPFC). Recognizing
and distinguishing different types of vocalizations or spoken words should most
certainly be a “ventral stream task,” and the notion that the ventral stream
may form the “sound-meaning interface” in human speech processing is often
discussed in the literature (Hickok & Poeppel,
2004; Scott & Wise, 2004).
Figure 4.11
Putative
dorsal and ventral processing streams in macaque auditory cortex, as suggested
from anatomical tracer studies.
Adapted by permission from the author and from Macmillan Publishers Ltd: Figure 3D of Romanski et al. (1999). Nat Neurosci 2:1131-1136, copyright (1999).
Against this background, a recent
set of experiments by Russ and colleagues (2008) is therefore of particular
interest, as these investigators were able to record activity from individual
neurons in the STG and the vPFC of awake rhesus
macaques, who were listening to ten very different types of rhesus vocalization
calls. These calls are acoustically very distinct, the animals make different
calls in different social situations, and there is little doubt that each of
these calls therefore has a different meaning for the animals. If neurons in
the ventral stream indeed represent the meaning of a vocalization, then one
might expect these neurons to be rather selective in their response to these
calls; that is, each neuron might respond to only a small subset of calls with
similar meanings. We might also expect that this specificity would increase as
one ascends along the ventral path from STG to vPFC.
Finally, we would not expect neurons in the vPFC to
be very interested in minute acoustic details, such as the temporal fine
structure of the sound, nor to represent much information in the temporal fine
structure of its discharges. After all, if I pronounce a particular word the
meaning of my utterance does not depend on whether I speak fast or slowly or
introduce small gaps between the syllables, and even speaking very rapidly I
would find it difficult to convey more than two or three concepts per second on
average. Meaning therefore unfolds relatively slowly, and the temporal fine
structure of sounds becomes irrelevant once their meaning has been identified.
Consequently, if neurons in vPFC, at the end of the
ventral stream, really represent the abstracted, lexical meaning of a
vocalization rather than the sound itself, we would not expect their temporal
firing patterns to convey much stimulus-related information on a millisecond
timescale.
What makes the experiments by Russ
and colleagues (2008) so interesting and surprising is that they produced a
wealth of data that clearly runs counter to all these expectations. Neurons in
the STG and vPFC are not very selective in their
responses. The large majority of neurons respond vigorously (with >50% of
their maximal firing rate) to more than half of the vocalizations tested. Nor
do responses become more specific as one ascends from STG to vPFC—if anything, the reverse is true. But both STG and vPFC neurons convey a great deal of information about which
of the vocalizations was presented in the temporal fine structure of their
discharges. Using spike pattern classification techniques identical to those
used by Schnupp and colleagues (2006) to analyze
neural discharge patterns recorded in ferret A1, Russ et al. (2008) were able
to show that responses of neurons in macaque STG and vPFC
also need to be decoded at a resolution of a few milliseconds if the individual
vocalizations are to be correctly identified. Furthermore, the reliance on
precise temporal patterning of the discharges is, if anything, larger in vPFC than in STG.
Do the results of Russ and
colleagues (2008) mean that our intuitions about how our brain “ought to”
represent the meaning of sounds are simply wrong, and that the meaning of a
sound is never represented explicitly through invariant, sparse, and
categorical responses? Perhaps, but alternatively it could be that, to see such
meaning-specific responses, one needs to look outside the auditory pathway.
After all, meaning is abstracted somewhat beyond the level of any particular
sensory modality, and it is not uncommon that the same meaning can be conveyed with
both sounds and pictures. Interestingly, recent work by Quian
Quiroga and colleagues (2009) found neurons in
structures buried inside the temporal lobe, such as the hippocampus, the amygdala, and the entorhinal
cortex, that may respond to pictures of some specific familiar object, say a
landmark or a person or a pet, and these same neurons may also respond to that object’s
name, either spoken or written. These object-specific neurons are highly
selective for stimulus category, responding typically to only one or two
stimulus objects out of over a hundred tested. At the same time, they are
unselective in the sensory modality, as they frequently respond as vigorously
to a spoken name or a characteristic sound as to a visual image. They have long
response latencies (300 ms or so for images, 500 ms or more for sounds), and
their discharges appear not to reflect acoustic features of the auditory
waveform in any way.
It is curious that such “semantic”
responses to the meaning of sound have been observed only in structures such as
the amygdala (which is thought to process the
emotional significance of stimuli, e.g., “are they scary or not”) or the
hippocampus (which seems to serve as the gateway to long-term episodic memory).
As we have seen, even at the highest levels of the auditory “what stream,”
neural responses appear overwhelmingly tuned to acoustic stimulus properties,
not their semantics. Perhaps we simply haven’t yet looked hard enough for semantically
tuned responses in higher-order auditory cortex. It is worth bearing in mind,
however, that such semantic responses may also be rare in the hippocampus, the amygdala, and entorhinal cortex. Quian Quiroga and colleagues
(2009) tested 750 neurons, and found that fewer than 4% of neurons (25 in total) seemed to be object specific and responsive to sound.
If semantically tuned neurons formed a small subset of the neural population in
higher-order auditory cortex, and if their responses were very highly selective
and “sparse,” then they could have slipped through the net in previous
investigations.
The brain does, of
course, rely mostly on acoustic information to process speech and
vocalizations, but it will also happily incorporate visual information if this
is useful. Listeners who suffer from hearing impairments or who have to operate
under difficult conditions with large amounts of background noise often find it
much easier to understand a speaker if they can also observe the movement of
his or her mouth, and “lip read.” At a very basic level, lip reading can be
helpful simply because of the temporal cueing it provides: Sounds you hear when
the speaker’s mouth is not moving are bound to be purely background noise. But
since the lips (together with the tongue and the soft palate) are one of the
chief articulators used to shape speech sound, visual observation of the lips
provides information that can help distinguish different phonemes and influence
their perception.
This is vividly illustrated by a
visual-auditory illusion known as the McGurk effect (McGurk & MacDonald, 1976). To create the McGurk effect, a video is made showing a person
articulating the syllables “gaga” over and over again. The video is then
synchronized with a soundtrack of the person speaking the syllables “baba.” If you watch a McGurk
video, your ears will hear the syllables “baba,” but
you can also see that the lips are not closed at the onset of the syllables, so
your eyes tell you that the syllables you heard could not have started with a
labial plosive. You will therefore not perceive the /ba/
that was actually delivered to your ears, but instead hear a /da/ or a /tha/, as these are
acoustically similar to the actual sound, but are articulated by the tip of the
tongue, which is not visible, so the eyes do not provide evidence against them.
The /da/ or /tha/ you
perceive is, in effect, the most plausible compromise between the /ga/ that is shown and the /ba/
that is played. You can find a McGurk effect video on
the book’s web site <flag>. Try watching it, and then just listening to
it with your eyes closed. The difference in the sound you hear depending on
whether your eyes are open or not is quite compelling.
With your eyes closed you will clearly hear that the movie’s sound track
consists entirely of the syllables “baba,” but when
you open your eyes the sound appears to change instantly to “dada” or “thatha”.
The McGurk
effect nicely illustrates how visual information can directly and powerfully
influence and enrich our auditory perception of speech sounds, and it probably
exercises this influence through visual inputs that feed directly into the
auditory cortex. A number of electrophysiological studies have reported
responses in auditory cortex to visual stimuli (Bizley
et al., 2007; Brosch, Selezneva,
& Scheich, 2005). Also, imaging experiments have
shown that auditory cortex can be activated by silent lip reading (Calvert et
al., 1997), and activity in auditory cortex can be enhanced when speech is
presented along with a movie showing the face of a speaker (Callan
et al.,2003). Selective enhancement of responses to
vocalization stimuli that are seen as well as heard has also been described in
monkeys (Ghazanfar et al., 2005).
Thus, visual information can contribute
significantly to the neural processing of vocalization stimuli, but it is
important to remember that the role of the visual modality is nevertheless a
minor one. Telecommunications technology has advanced to the point where video
telephony is becoming widely available, yet most of us do not feel the need for
it. Nobody would think it a practical idea to rely on the video only and switch
the sound off. Educational policies that discourage profoundly deaf children
from learning sign language and instead try to teach them to understand normal
speech through lip reading alone are well intended, but nevertheless badly
flawed. They ignore the important fact that the most of the articulatory
gestures we use to encode our thoughts in speech, such as voicing and all the
subtle movements of the tongue and the soft palate, simply cannot be observed
by looking at a speaker’s face. They are accessible to us only through their
acoustic fingerprints, which a healthy auditory system can decipher with
surprising ease.
As we have seen,
speech most likely evolved from initially rather simple vocal communication
systems, comprising perhaps less than a dozen or so distinct messages, such as
mating calls, alarm calls, pup calls, threats, and a few others. From these
humble beginnings, speech evolved into a staggeringly sophisticated
communication system, in which humans can combine and recombine a relatively
modest number of speech sounds to communicate a seemingly limitless variety of
ideas. These ideas reach the ear of the listener encoded as a more or less
continuous stream of amplitude- and frequency-modulated sound. But not all
spectral and temporal modulations in the speech signal are equally important.
Relatively coarse levels of detail (temporal modulations between 1 and 7 Hz and
spectral modulations of less than 4 cycles/kHz) are
usually sufficient for a successful decoding of the message.
The auditory system is thought to
decipher speech sounds through a hierarchy of successive analyses, which operate
on different timescales. Acoustic-phonetic analysis examines amplitude and
frequency modulations in the incoming sound in order to detect and characterize
speech sounds within the signal, phonological processing aims to reconstruct
how speech sounds are arranged to form syllables and words, while lexical-semantic
analysis aims to decipher the meaning of the sounds. Most of these processing
steps are thought to involve areas of cortex, particularly those on the upper
part of the temporal lobe, but also some frontal and parietal areas,
particularly in the left cerebral hemisphere, but many important details of how
these cortical areas operate remain obscure.