On Musical Selection in Research: 4. With The Top Down

Did you hear the one about how harmony is the key to music? That really struck a chord.

If one had to sum up existence in a single word, oscillation would do nicely. Feynman often used the word jiggling, because it’s funnier.

“The world is a dynamic mess of jiggling things if you look at it right. And if you magnify it, you can hardly see anything anymore, because everything is jiggling and they’re all in patterns and they’re all lots of little balls. It’s lucky that we have such a large scale view of everything, and we can see them as things – without having to worry about all these little atoms all the time.”

Richard Feyman, Fun to Imagine, “Rubber Bands,” 1983

You and me and everything we know are merely collections of oscillating particles and waves in various mediums of varying mystery. Our sensory systems detect these jigglings in myriad ways and sensitivities. Comparing, say, olfaction to proprioception would be rather nonsensical, but I think we can make a good case for comparing hearing to everyone’s favorite perceptual process, vision.

Light information is complex, and quite a lot of the brain dedicates itself to perceiving and decoding this data stream, eventually cobbling together a rough approximation of our visual surroundings. While visual information is remarkably important for typical humans, the complexity of this perceptual process means we’re not quite as time-aware of visual information as we’d like to believe. To illustrate this, we’re going to look at visual recreations (movies) vs. sound recreations to see the difference in resolution that tricks the eye into perceiving movement and what it takes to trick the ear into the same thing.

Frame rate, or frames per second (FPS), is a measure of how many still images must be displayed per second to give the impression that they are a smoothly flowing moving scene. At about 10-12 FPS and slower, humans can register each still as individual images[1]. Higher frame rates start appearing as seamless motion, though at first it appears quite choppy. A standard US film has a frame rate of 24 FPS to avoid the choppy/sped-up effect of older films. The faster a frame rate, the smoother the illusion appears, as depicted here:

A great, simple demonstration of four different frame rates.

Perceptual sound information is objectively less complex, and this means we process this data stream more directly through the auditory system. This allows for a more rapid translation of external oscillations into auditory temporal neural signals. People who edit both film and audio actually deal with this in a practical way all the time. Visually, snapping edits to 24 per second is perfectly fine, but that’s only a little over 40 ms, which is enough to sound out of sync to most listeners. To try this yourself, just get a drummer to play a drum machine with a 40 ms latency on the drum sounds. This demonstrates a concept which I have taken to calling:

Cognitive Temporal Resolution

Compare the frame rate video above to the below demonstration of three different audio sampling rates.

Examples of 44.1 kHz, 22.05 kHz, and 11.025 kHz digital audio sampling rates.

Because their oscillations are quicker, and thus closer together, higher frequency instruments like cymbals are a dead giveaway when it comes to sound quality. Sample rate is the number of samples taken per second in a digital audio recording. It looks like this:

Visualization of sampling rates. X-axis is time.

If you’re interested, you can learn more about sampling rate, bitrate, and bit depth here:

You’ll note that we discuss these audio resolutions in the tens of thousands, as opposed to the couple dozen or so required in temporally arranged visual information. Heck, humans can drum roll faster than film frames fly by, and it still sounds like individual drum hits, rather than tricking our brains like a fast visual frame rate does. Or for a bit of fun, take a listen to these accelerating snare hits, which begin at 16th notes on a lazy 60 BPM and steadily accelerate.

808 snare 16th note accelerando from 60 BPM to 999 BPM

First, a rising bass note creeps in, then, toward the end of the clip, the individual snare hits grow indistinguishable. At this point, we perceive them as an audible pitch.


Frequency is a physical attribute, a measure of the number of oscillations within a medium over time. Pitch is the human perception and subsequent analysis of frequency. In English and most cultures, we think of this process as the ability to arrange stimuli along a spectrum of high and low tones.

The average range of a young person’s hearing is roughly 20 – 20,000 Hz. Just a fun fact, the reason standard sampling rate is 44,100 Hz is because it needs to be about double whatever frequency it’s representing to accurately convey one full wave cycle. So, the highest frequency a sampling rate of 44,100 p/s can covey is 22,050 Hz, which, you’ll notice, is higher than the human range of hearing. You can read more about higher sampling rates and the perception of digital audio with virtual instruments for a bit more nuance on that story.

As you age, the top frequency of your hearing range lowers. You can test where you’re currently at here if you have even marginally decent speakers. Bass is actually pretty tough to test at home due to practical factors like speaker size and placement, standing waves, room tuning, etc., but testing higher frequencies is straightforward. I’m 36 years old, and my hearing cuts out somewhere between 16,000 and 16,500 Hz.

Below, you’ll find a chart that shows average spectral ranges for standard Eurotraditional musical instruments. Take things like this with a grain of salt, but it’s a nice visualization nonetheless.

Go here for higher resolution, or here for an alternate interactive version

This is a mixing guide, hence the handy breakdown of ranges along the bottom. Notice also that after around 4 or 5 kHz, humans lose the ability to extract pitch information from sound [1]. So, everything we’ll be talking about for the remainder of this section will deal mostly with sounds that occur below that shelf.

Pitch Perception

I’m going to draw a lot of the following info from the book Pitch: Neural Coding and Perception, which includes not only a wealth of useful information, but also the following list of unanswered questions:

  1. How is phase-locked neural activity transformed into a rate-place representation of pitch?
  2. Where does this transformation take place, and what types of neurons perform the analysis?
  3. Are there separate pitch mechanisms for resolved and unresolved harmonics?
  4. How do the pitch mechanisms interact with the grouping mechanisms so that the output of one influences the processing of the other and vice versa?
  5. How and where is the information about pitch used in object and pattern identification?

A basic flowchart of auditory signal flow.

Pitch: Neural Coding and Perception, Plack, Oxenham, Fay, 2005

Notice all the question marks? This is just another example of how little we truly know about how pitch gets perceived.

That said, here are some things we do know. Frequency is related to loudness, which has been mapped using the equal-loudness contour chart. Low frequencies are processed differently than high frequencies, although the exact mechanism for how this works is still mysterious. This loudness contour is very much in-line with our inherent auditory preference for vocals. Pitch perception involves top-down processing [1]. It’s probably both top-down and bottom-up to some extent. Top-down essentially means “cognitive,” while bottom-up means “perceptual,” although I’m sure plenty scientists would argue with that semantic simplification. You might also consider them to mean “analytic” and “triggered” respectively.


You’ll recall from part one how amorphous and subjective music harmony is as a theoretical concept. This may be why so much research regarding it is rather amorphous and unspecific, and often at odds with each other. See examples of this here, here, and here.

The concept of harmonic structure (especially in Eurotraditional theory) is based on the idea of an interactive tension-resolution cycle between consonance and dissonance. I’m going to go with a pretty book on this phenomenon, Neurobiological Foundations for the Theory of Harmony in Western Tonal Music. Much of the below info is paraphrased or quoted from that text.

Consonance generally means a bunch of tone harmonics line up. This means neurons that are used to firing/vibrating/resonating in sync, do so. The brain initially likes this, because it matches what it often hears in nature.

Dissonance (or roughness) means a bunch of neurons either can’t figure out if they should resonate together or are, somehow, cognitively bumping into each other, so to speak. The brain initially gets annoyed by this.

However, if you play too many sequential harmonies considered consonant by the listener, it often starts to sound boring. As we’ve already covered, brain hate boring. To avoid this static state, we arrange harmonies of varying dissonance/roughness/tension for a while before resolving that tension with a consonant harmony. The most relieving voicing of that chord will usually have the key’s fundamental pitch (the tonic note) in the bass part. This is (probably) because we are cognitively expecting the fundamental due to learned real-world neural clusters. Thus, hearing this resolution engages the neurochemical expectation/reward system. Chord progression (along with volume changes and the entrance of a human voice) has been strongly tied to the chills response or frisson often studied in musical neuroscience, because it’s easy for test subjects to check a yes/no box about whether they have experienced it.

A demonstration of hearing a sounded vs. missing fundamental.

Harmony can be thought of as having both a vertical and horizontal dimension. This is directly analogous to music arranged along a staff. If two or more pitches are played at a time, this is the vertical dimension. Doing this in succession over time is the horizontal dimension. The horizontal dimension involves psychological priming, meaning perception of one stimulus affects how those following it are perceived, and strongly favors top-down processing. The time window over which sound information is integrated in the vertical dimension spans about a hundredth of a second to a few seconds, i.e. from sixteenth notes to tied whole notes at 120 beats per minute. Thus, many minimalist pieces fail to register as true harmonic progressions due to their chord changes falling outside of this perceptual window.

If you’re interested, I really like the City University of Hong Kong’s online resource for auditory neuroscience website, which has an excellent list of articles regarding harmony and pitch as well as resources for other musical elements. This discussion handily leads us into one of the most interesting phenomenons that arise from music perception.

The horizontal dimension deals with both successive chords/tone clusters (harmonic progressions) and individual tone lines (melodies) almost always in the topmost voicing, which leads us nicely into the next pitched musical element.


Melody is the combination of rhythm and pitch, a linear succession of tones perceived as a single entity. A melody as a single unit emerges following top-down processing. You can have a melodic bass line, but due to the aforementioned difference in processing low and high frequencies, a lot of this information is processed rhythmically instead of melodically. Extracting tonal info from purely sub-bass sounds is difficult for the average listener, as opposed to bass tones with lots of higher-frequency information such as would be achieved with bass effects pedals, for example.

The function of melody is closely tied to memory and expectation. You can imagine melodic expectation as a sort of cognitive temporal probability cloud constantly running during active listening. Melodic expectation is probably learned rather than innate and depends on cultural background and musical training. The process for expectancy of harmonic and melodic info is additive, i.e., we compute the rhythmic pitch line and harmonic info together to analyze what psychologists call – and I’m not making this up – a music scene.

Unfortunately, psychologists never specify which music scene.

The multitude of melodic perception/expectation models mostly describe different aspects of the same system. One, called melodic segmentation, is used in computational analyses and uses formulae to automatically separate melodies into small segments, just like the short repeating units in postminimalist music. A well-known researcher Jamshed Bharucha helped pioneer the fascinating concept of melodic anchoring, which describes tone stability and instability based on where the pitch falls in a harmonic context. Repetition helps us commit complete pleasing melodic lines to memory, which is the basis for the concept of the hook used in folk and pop music. This also likely cements how we expect later melodies to progress, creating a sort of taste-feedback-loop.

Rhythmic pulses and harmony form the basis of music like Reich’s Electric Counterpoint mentioned in part three. Later in the same piece, melodic segmentation provides discrete repeating melodic note groups, approaching but not quite arriving at a traditional melody. Now, with the addition of true melodies, we are able to build the majority of music. All that’s left is determining the context and tradition of what we want our music to sound like. This involves choosing the remaining elements of the musical cake: timbre (instrumentation, production, loudness), overall structure/form, and in many cases lyrical content.


This word is loaded and the source of much contention [1][2][3][4]. It is a purely social and semantic method of organizing sound. Like many categorization methods, it is subjective and rife with overlap, gray areas, and fusion practices which then give rise to new definitions that further muddy the waters. But it is directly related to the context or tradition of how music is received and so shall be addressed.

Genre terminology is best used as shorthand for music discussion. “Alternative rock with funk elements” invokes alternative, rock, and funk to quickly communicate what to expect when listening to the Red Hot Chili Peppers. One could could then sonicly associate similar bands like Primus, Living Color, Lenny Kravitz, etc., highlighting the communicative advantages of the system. Genre preference is often a form of sociocultural identity, making it an important aspect of human existence.

Genre preference involves an individual’s relationship to the familiar and the unfamiliar. Musical training/expertise strongly affects a listener’s reaction to unfamiliar music. The more expertise in an individual, the more likely they are to enjoy unfamiliar music likely due to better-rehearsed top-down processing. Music novices are better able to perceive elements of familiar music, which adds to music enjoyment.

Many publications, critics, theorists, and analysts now decry the obsession with genre in popular music award shows. The Music Genome Project is an initiative by the founders of Pandora Radio that automates playlist generation by way of seed association. This means the listener chooses a song, album, artist, or genre which is then examined for attribute keywords that the project considers genre-definers. This is achieved with the help of a team of analysts assigning attributes to individual songs. While the exact list is a trade secret, you can view an approximation of the 450 [1] attributes listed here alphabetically, and here by type. For the purposes of this article, one things stands out:

The vast majority of attributes in the Music Genome Project describe melodic elements.

Regardless of how much we talk about the importance of form and rhythm in music, what really defines how a person receives and associates with music is its pitched content. While rhythmic cadences are vital in genre definition, attributes relating to vocals, lyrics, pace/complexity of harmonic progression, instrumentation, timbre of instrumentation (including things like guitar distortion levels), and so on dominate the list.

The direct mechanical effects of rhythmic auditory stimuli (such as on the motor cortex or cardiovascular system) do not seem to exist with regards to most pitched and harmonic information, with the exception of low frequencies that register in part as perceptually rhythmic. To sum up, melodic/harmonic pitch perception:

  1. Takes longer to develop in humans
  2. Relies more on learned environmental patterns
  3. Displays greater diversity across cultures
  4. Promotes greater subjectivity in the listener

How and why these aspects all tie in together is grounds for exciting research. It also suggests an explanation for why the pentatonic scale in particular is so universal. The fundamental with harmonic overtones matching this scale will be found in naturally occurring tones more often than other spectral architectures, means that these expectant neuron clusters will form in humans regardless of background. Whether there is a deeper or more innate framework for this other than world experience growing up is not yet known.


Musical structure is memory. A list of musical movements/genres is also largely a list of popular structures throughout history. Song structure lends itself quite easily to analysis, which is why I’m not going to rehash it too closely here. The most important thing to know is that composers use regular returns to musical events to ground listeners in familiar territory before deviating again into new content.

Most of the following is paraphrased or directly quoted from Bob Snyder’s excellent book, Music & Memory: An Introduction.

Memory is the ability of neurons to alter the strength and number of their connections to each other in ways that extend over time. Memory consists of three processes: echoic memory and early processing; short-term memory; and long-term memory. This modern hierarchical concept of memory aligns with Stockhausen’s three-level model of timefields. Each of these three memory processes functions on a different time scale, which Snyder then relates to listening as “three levels of musical experience.”

  1. Event fusion level (echoic memory/early processing)
  2. Melodic and rhythmic level (short-term memory)
  3. Formal structure level (long-term memory)

The initial echoic memory sensations decay in less than a second. This experience is not analyzed during this level, but rather exist as a raw, continuous stream of sensory data. Our friend Dr. Bharucha helped define the specialized groups of neurons that extract some acoustic data from this continuous stream, a process called feature extraction. Such features include pitch, overtone structure, and presence of frequency slides. These features are then bound together as coherent auditory events. This information is not the continuous barrage like in echoic memory, meaning the amount of data is greatly reduced. Together, feature extraction and perceptual binding constitute perceptual categorization.

Snyder’s memory model of auditory perception.

After extracted features are bound into events, the information is organized into groupings based on feature similarity and temporal proximity. These can activate long-term memories called conceptual categories. Such memories consist of content not usually in conscious awareness, which must be retrieved from the unconscious. This can take place either in a spontaneous way (recognizing and reminding) or as the result of a conscious effort (active recollecting). However, even when recalled, these memories often remain unconscious and instead form a context for a listener’s current awareness state. This is called semiactivated, meaning they’re neurologically active and can affect consciousness (emotional state, expectation, decision-making, etc.) but are not actually the present focus of cognitive awareness.

If information from a long-term memory becomes fully activated, it becomes the focus of conscious awareness, allowing it to persist as a current short-term memory. If not displaced by new information, these can be held for an average of 3-5 seconds in typical humans. After this time window, it must be repeated or rehearsed internally, i.e. consciously kept/brought back into focus, or it will decay back to long-term memory. The more striking the information in question, the more likely it is to more permanently affect this system by creating new long-term memory information.

There is a constant functional interchange between long- and short- term memory. This is the basis of formal structure in music.

Pitch information is extracted in auditory events that take place in less than 50 milliseconds. This races by as part of the data stream which is not processed consciously.

Events farther apart than 63 milliseconds (16 events per second) constitute the aforementioned melodic and rhythmic level of musical experience. Since these occur within the 3-5 second window of short-term memory, we consider separate events on this timescale as a grouped unit that occur in the present. This time window is essentially a snapshot of consciously perceived time. We parse this musical perception level in two dimensions: melodic grouping according to range similarity, rising/falling motion, and reversals of that motion; and rhythmic grouping according to timing and intensity. Perception information events received within this window are considered by the brain to be available all at once.

Events intervals lasting longer than 5 seconds (roughly, depending on the individual and expertise/training) fall into the category of formal structure. Here, our expectations are manipulated to allow auditory events to fall into unconscious long-term memory. This manipulation activates our limbic reward system, and that feeling is stronger in music we find familiar.

This is how musical structures rely upon the three levels of memory and traditional genre expectations to manipulate the dopaminergic cycle of expectation and reward. Genres/styles/traditions achieve this goal by techniques such as symmetric bar structures, removal and return of repetitive catchy (often sung) melodies/hooks/themes/ostinati, and drastic changes in the volume and presence of vocals and instruments.


Obviously, pitched information is vast and malleable and cannot be truly summarized in a series like this. Many, many books have been written on it, though in terms of cognitive structures much of it remains a mystery. I hope this has at least piqued your interest to some extent. Next will come the final section of this series.

Thanks for reading!

About the author


View all posts

1 Comment

Leave a Reply