Choose Language Hide Translation Bar
Level II

Design a Digital Music Melody Hearing Test (2022-EU-30MP-989)

Charles Chen, Advisor, Applied Materials
Mason Chen, Student, STEAMS
Patrick Giuliano, Advisor, STEAMS


This presentation showcases designing a special music hearing test to test a musician’s ability to hear melodies. The Definitive Screen Design (DSD) platform in JMP was utilized to consider six music script input variables (step, speed, notes changed, note level, repeat, difficulty) and then added two more center points for evaluating the Gage R&R performance. Each DSD run is a multiple-choice test allowing respondents to pick their response from four available choices.


JMP Hierarchical Clustering platform was used to group similar music scripts from the 20 scripts provided by DSD runs and assign the similar scripts for the other three non-correct choices. The correct choices were then added to make each hearing question more challenging. Next, a stratified cluster hybrid sampling method was adopted to select 30 candidates to participate in the survey. Once the scripts were determined, a commercial music synthetic software program was used to create this DSD melody hearing test. After collecting the survey results, the Fit Definitive Screening platform in JMP was used to analyze the DSD survey results. The goal was to determine the best rater (higher propensity for accurate rating of musical melodies) to serve as the judge for next project phase.




All right.

Well, thanks, everyone, for joining us.

The title of our project

is Design a Digital Music Melody Hearing Test.

I'm Patrick Giuliano,

and my co- presenters are Charles Chen and Mason Chen

who couldn't be here today.

So I'm going to be presenting on their behalf.

And this is a project,

a high- school STEM project inspired by ESTEEM's methodology,

which is basically STEM but with AI, math, and statistics well- integrated.

Okay, so just to introduce this project

in the project management flavor with the project charter.

The purpose of the project, in effect, is to design a test

to test the hearing capability of a musician.

The experimental design, philosophy or methodology we use

is JMP's powerful, definitive screening design capability.

And we designed the test based on six music melody variables

in order to test hearing capability,

where each question starts with a short melody

followed by four choices, and where only one is repeated

and the other three melodies are similar but not identical.

From this test, each listener has to pick their best choice

among the options available.

Once we designed this test, we analyze the test survey results.

We build a sensitivity model

in consideration of six music hearing variables,

and then screen the listeners to determine which ones performed the best

in the music hearing test.

And in doing so, in the screening process,

we analyze the strengths and weaknesses of their hearing capability

in the service of ultimately creating an orchestra

with a grading of listeners who are highly capable to evaluate them.

Okay, so in the service of science,

we have an introduction to the mechanism of hearing

where the ear is just basically a frequency- receiving apparatus

that collects sound and vibration of the ossicles in the ear

and cause the mechanical vibration to be converted

into an electrical stimulus, which is interpreted by the brain

by the auditory nerve and ultimately by the brain.

All right, so before we get into the experiment

and the variables that we analyzed,

let's talk a little bit about the frequency range of hearing

among individuals depending on their age.

So people of all ages without hearing impairment

should be able to hear at a frequency of approximately 8,000 Hertz,

and gradual loss of sensitivity to higher frequencies with age

is a normal occurrence.

And so what the science tells us

is that the auditory structures of younger people

are typically more capable of absorbing or interpreting

hearing higher frequency sounds, which is, of course,

relevant in terms of which instruments people are playing,

where the violin has a higher pitch than the cello,

so perhaps a younger person might be more suited

for playing the violin than an older person.

And so this just gives you an idea that basically

people that are in their fifties maybe may only be able to hear

at 12,000 kilohertz... or 12 kilohertz rather, 12,000 Hertz,

whereas people in their 20s can hear up to perhaps 18 kilohertz.

And just to give some context, the average frequency range

for what we listen for the sounds that we hear most often every day

is between 250 Hertz and 6,000 Hertz.

Okay, so what are some challenges associated with hearing

in the context of sounds of different frequency?

So people typically miss high frequency sounds

more often than low frequency ones.

And people with high frequency hearing loss,

they have trouble hearing higher- pitched sounds, of course, right?

And so higher pitch sounds can usually come from women or children

and are in the upper two to eight kilohertz range.

And what's also typical

with high frequency hearing loss in many people

is the presence of a phantom sound, which is the condition called tinnitus,

and that competing sensation of sound can also inhibit a person's ability

to distinguish other high frequency sounds.

So clearly, age is an important factor in terms of designing

an effective hearing test and developing an effective panel of listeners

who are attuned to music.

Although we didn't explicitly consider age in our experiment,

as you'll see in the subsequent slides,

it definitely could be a factor that we could explore further

in our sampling strategy in terms of the survey respondents that we choose.

Okay, so the basic measure of hearing performance

is called an audiogram.

And what you see in the graph on the right

is just a plot of hearing threshold level in decibels on the vertical axis

versus frequency on the horizontal.

And you can clearly see that as hearing loss progresses,

the threshold level of sound and decibels starts to increase

and the degradation and the performance is shown as the plot

splitting the performance by year moving down into the right.

That's the trajectory of the line that's connecting the points

moving down into the right.

Okay, so just a little bit more background

before we launch into the design of the survey and the analysis.

The intent here is just to emphasize

that frequency interference can be a problem

in producing a melodious harmony in an orchestra in particular

or in any sort of musical composition.

And what we're basically showing here

is the difference between what's called fundamental frequencies and harmonics

in the context of a piano,

at least at the note scale indicated at the bottom.

Okay, so what do we know about the music note frequency spectrum?

Well, each note has, not surprisingly, based on the introduction so far,

each note has a particular frequency.

As an example, middle C is at around 262 Hertz,

and higher notes, of course, are going to have higher frequency

and lower notes have lower frequency.

And this slide just gives you a context

for what frequency the notes correspond to.

So note A is around much higher, 440,

then note C at 261 in the second set on the right,

in the lower portion of the slide.

Okay, so there's some relationship between frequency and the number of notes.

Frequency needs to double every 12 notes,

and we have 12 notes in each octave, seven white and five black.

And so you can see that this relationship,

that frequency follows as a function of these notes,

and n is a power- law type relationship.

All right, so taking us back now to the project

and the implementation and the analysis.

So the project plan has three phases.

The first phase is what I'm going to cover,

it's the analysis that I'm going to discuss today.

The first phase is effectively the process of identifying

which people are best hearing performers from a collection of survey results

that we send out based on the survey that we designed.

The second phase is identifying the best hearing performers

from the survey results in order to serve as judges.

In this phase, basically, we try to work on forming the orchestra

prior to phase three where we're actually doing the forming.

But in this instance, we're thinking about things

like which instruments have any potential limitation.

And we may give the same melody to different test instruments,

and not every instrument can play every melody, obviously.

And so the idea is,

how do we know that the individuals that are playing

are playing these instruments accurately?

Well, we need judges who have good listening capability.

So the judges that we curate from phase one

will provide that excellent evaluation in phase two.

So once we have that in place, in phase three,

we can actually really form the digital orchestra.

And we'll think about things like how many players should be involved,

who should play where, obviously.

We'll have a good understanding of how the melodies could be difficult

for certain instruments.

And this is why we need phase two in the middle.

Okay, so here's our design or survey question design.

So we've identified six variables for this hearing test

related to music, the parameters in music:

step, speed, notes changed, notes level, a repeat variable,

and a difficulty variable, a categorical variable, easy or difficult.

The experiment, as I mentioned before, is we're using JMP's DSD,

and in addition to the default, we're generating a default DSD,

and then in effect, we're augmenting the design

by adding two more center points.

So we're doing an 18- run DSD,

which includes one center point, which is row number three in this table,

to have indicated with a zero and an arrow highlighting row three.

And then we're adding two more center points

at row 10 and row 20 respectively.

And the idea here in terms of, we're replacing these center points,

is we want to get an idea of how consistent the results are

throughout the experiment.

So we try to put a center point roughly at the beginning, in the middle,

and the end of the experiment.

And this is analogous to understanding whether a measurement process is stable,

if you're in a manufacturing environment, getting a sense for that.

And then the other important thing about our design here

is that we're randomizing the test sequence,

and that's something that we can do in JMP through the generation of the design.

And I'll show a little bit about that briefly

when we come to the next few slides.

And that randomization is really important because it helps eliminate any bias

due to factors that aren't in the experiment

when we run the test.

And that bias is referred to sometimes as lurking variation

or variation due to lurking variables.

Okay, so there's another consideration that I touched on.

It's in the context of randomization, but it's slightly different context,

which is a little bit more unique

to this particular application and experiment.

And so basically, what we did is generated an initial random variable

and assigned a random sequence, one, two, three, four, and randomized.

But we did a recoding on that.

So we labeled one A, two B, three C, and four D,

and that' s what we see

in terms of identifying the correct answer.

So in the two columns at the right in this table,

in this 20- row table,

we're identifying what the correct answer should be

in terms of the letter,

which is associated with a random variable of one, two, three, four,

w here one corresponds to A, two to B, three to C, and four to D.

And we're doing this to ensure a uniform distribution

of the location variable.

And basically what that means in practical terms is that A, B, C, and D

all have equal percentage of being selected at random.

And this is to avoid the biasing situation

where a student may pick the same answer over and over again

in order to possibly increase his or her chances of performing well,

or perhaps because the survey respondent isn't paying attention

or isn't engaged in the survey.

All right, so here is where we come to an evaluation

of the performance of the design, of the DSD design.

The way we approach this

is through the evaluation of the statistical power of the experiment

which is shown on the left, on the panel on the left

through an evaluation of the confounding pattern

or the extent to which factors are correlated in the experimental design,

and that's shown in the panel in the middle,

and the uniformity, what we call the uniformity of the design,

which is simply, what does the structure of the design look like

in a multivariate space?

Have we covered all of the design points in an approximately uniform way

so that we're able to predict across the entire range of the experiment

with the same degree of precision?

And so what these three indicate, and going back over to the left,

is that the overall power for each of the factors in the experiment

is greater than 90 percent, which is good.

And it shows us that we have good sensitivity to detect effects,

if they're actually there in the population.

The panel in the middle shows

that the risk of what we call multicolinearity

or excessive correlation among the experimental factors is low

because all of the pairwise correlations in this correlation matrix,

most of them are blue,

where a more bluish correlation corresponds to a lower correlation,

where solid blue indicates zero correlation.

And the squares that are closer to a red shading

indicate a higher extent of correlation among factors or terms in the experiment.

And so overall, what we see is that we look for correlations

that don't exceed 0.3,

and that's typically all the squares in this plot

with the exception of those slightly reddish squares

where the correlation is a little bit higher.

And that's because we have categorical factors, right?

We have at least one categorical factor in this experiment.

And if we didn't have the presence of a categorical factor,

this plot would look even bluer.

So we say that in DSD, we don't recommend

adding too many categorical variables into the experiment,

because if we do, then we increase this correlation problem,

which affects our ability to produce estimates in our model that are precise,

leads to inflation of variance in our estimates.

And the final plot on the right, on the far right,

which is an indication of the uniformity of this design,

is a scatter plot matrix in JMP,

and it shows each variable versus every other variable.

And what we're looking for

is for white space to be minimal in this plot.

What I've drawn is a little circle here which your eye can easily pick out.

There's a little bit of extra white space there

at the intersection of Repeat and Step.

And that, again, is because we have a categorical variable in our experiment.

And so truthfully,

there's no perfect zero in the main effects,

no true center point in the main effects

due to the presence of that difficulty variable,

the categorical variable.

And that's reflected in the non- symmetric pattern

of the scatter plot matrix on the right, slightly non-symmetric,

where that asymmetry is indicated in that white space

and with the circle that I've drawn.

Okay, so before I discuss this slide, I just want to quickly show you

how I got to these design diagnostics.

So what you're seeing here is the table that I just showed you.

And I've generated this design using the DSD platform under the DOE menu,

under Definitive Screening and Definitive Screening Design.

And after I did that, JMP already generate,

after I complete the design table generation process

and fill in the results, JMP generates a DOE dialogue script,

saves it to the data table,

and I can actually relaunch the DOE dialog,

and I can also evaluate the design.

So I'm going to go ahead and quickly click on Design Evaluation.

And this is just an overview of the design.

And right here under Design Evaluation

is where I get the diagnostics related to p ower,

which I showed you on the left panel on that slide,

the diagnostics that indicate

to the extent to which factors or terms in the experiment are correlated.

And that's shown here in the color map on the correlation.

And to generate the plot,

looking at the uniformity among the factors,

I actually have to go in and do that in s catter plot matrix

under the Graph menu, S catterplot Matrix.

So that's just some context for you.

And now, I'm just going to quickly bring up the next slide

and then come back to JMP here

just to dynamically show you what we're doing.

So here's probably the most interesting part

of this experiment.

How do we increase the survey test difficulty

and do it in a smart way?

Well, we can use hierarchical clustering analysis to do that.

Now, we already know the correct answer.

It's indicated here in the corresponding column.

The Choice column,

the columns of the four variables on the right

which indicate the choices corresponding to the 20 melody choices

are indicated there.

So we know, for example, in the first row,

the correct answer is C corresponds to melody one

where the C ID number is one.

So we already know the correct answer

where we've assigned it in terms of row order based on a random number,

but how do we pick the other three answers?

Well, based on hierarchical clustering,

we can get a sense of how close each of the other three answers are

to the correct answer.

And in this way, we can make the test a little more difficult.

So all the answer choices are from the 20 melodies.

How do we pick the closer formalities for each question,

or the closest formalities, if you will,

or even maybe melodies that are relatively close together

based on the clustering criterion, but not honoring that criterion strictly,


So this might seem a little bit nebulous,

but in effect, all we're really doing

is telling JMP to assign a clustering scheme by row

and based on some clustering criterion that we specify.

And by default, that criterion is Ward.

So I'm just going to show that dynamically here.

So I have the table open.

All I did here is run Hierarchical C luster under the C lustering menu.

And once I ran this, I went ahead and invoke Cluster S ummaries,

which I turned on here.

And then watch what happens here when I click on each of these clusters.

So you can see that when I click on each of these clusters.

These are the clusters.

So seven and 18 are associated with each othe r,

14 and 17, eight and nine, row two and 13, and so on.

So this is the idea.

We're using the power of JMP to identify rows

that are associated with each other.

And in this way, by arranging the answer choices close to each other,

we make it relatively close to each other by following some schema like this,

we make the test more difficult.

All right, so just launch back into slides here again.


All right, so basically the last step here in terms of completing this experiment

is in addition to using a passive criteria for increasing the difficulty of the test,

we want an active criteria.

So we want to be able to separate,

in effect, the beginner level from the advanced level.

So think of it like this.

If every question was super difficult

or if all the choices were very hard to discriminate from,

then you wouldn't be able to distinguish between an advanced- level respondent

and a beginner- level respondent

because everybody would miss all of the questions.

Similarly, if you made all the questions too easy,

then you'd have all experts and no beginners,

and so you have no differentiation.

So based on the science, we have a hypothesis that step and speed

are the most important factors for performance,

for hearing performance, for discriminating between a good melody,

a good composition, and a bad musical composition

and a bad one.

So are we sure about that?

Well, one thing we can do is we can re code the step and speed

by a 50 percent reduction

if it's at difficulty level equal to difficult.

And by doing that, in effect, we still have five variables

and those are indicated in the shaded, right?

So the recoded step, recoded speed are the two columns that are shaded.

And then we have the notes changed, the notes level, and the repeat.

So the DSD is still orthogonal.

We still have three levels.

We have five variables,

but actually we could incorporate up to six in the DSD design.

So how do we increase our value in effect by increasing that variable number to six?

Well, we can add the difficulty variable or the categorical variable

which indicates either easy or difficult.

So we decided to use step and speed, combined with these other three variables,

and the total sample size is still 18 plus 2 or 20

with the two center points and the one center point by default.

But now, we get five levels for speed and step, not three.

So by doing this little transformation,

we smartly create five levels on two variables

instead of just having three levels,

which is typically what we would have in a DSD.

So I think this is a unique approach

that's also quite specific to this problem context

and gives us more levels in our design.

Okay, so this is our design.

How are we going to create the...

What software are we going to use to basically generate the hearing tests?

Okay, well, this is just an overview of Music S oftware S ynthesizer,

which is what we use, soft synth.

And we utilize it to create

24 multiple choice music melody hearing tests.

It's obviously convenient and portable and fast.

All right, so how do we distribute this survey smartly?

Well, our approach is...

Many people do one sampling method.

But here, our approach is to integrate all the different sampling methodologies,

cluster sampling, stratified, and some additional clustering within

in order to distribute the survey to the right audience

to make the survey the most useful.

So when you're ready to send out the quiz, how do you do it?

Well, I have some examples here.

Who should play the music?

Well, there are people who know the music and people who don't.

So we only want to send the surveys

to people who are already familiar with the music, right?

Because ultimately, we want to use these people

to evaluate the performance of an orchestra.

In the stratified sampling sense, we have different kinds of instruments.

We may have five students in a particular pool

that know how to play piano,

we may have two that know how to play violin,

and we may want to sample smartly

so that we only pick a certain number within each strata of players,

people who play particular instruments.

So we may pick randomly within each of these strata

in a certain sampling rate.

And again, with respect to clustering,

we can think of location in terms of practice location or geography

as a selection from many different geographies.

In a sense, we cluster and limit our selection criteria

to only the San Francisco Bay area,

because practicing in person is much easier than practicing virtually.

Okay, so really the point is that this survey dissemination

and survey data collection processes is very holistic

and increases our chances of producing an effective test set, if you will,

of evaluators to help us form the most high- performing orchestra.

Okay, so quickly, to wrap everything up,

we studied the human hearing frequency range,

the instrument frequency spectrum, the music frequency formula,

and we designed an innovative music melody hearing test using DSD.

We also implemented two interesting approaches

to increase the difficulty of the test, hierarchical clustering,

as well as rescaling the levels

of the most important predictors on our responses for the test answers.

And we use the music synthesizer software

to basically disseminate the hearing test across the six music melody variables.

And in our strategy for dissemination, we use the holistic sampling methodology.

So this in closing, some of the approaches that we use

and the science that we developed could be used to develop a hearing aid,

a music melody hearing aid.

And in our current market that we're aware of,

hearing aids are really specially designed for people with hearing loss,

but the idea here would be, how about making a hearing aid

that's about amplifying a certain signal from noise, right?

And that would, in effect,

increase music melody hearing and detection, right?

And so the main objective here

would be to block out noise that's extraneous,

for example, noise from the audience, and then amplify the signal portion

for the particular frequencies that are important

for playing a particular instrument, or even using this type of technology

to even out the pitch, to amplify the transition between melodies.

And so in future work, a similar DSD design can be implemented

in terms of developing this kind of technology.

So thank you very much for listening and let us know if you have any questions.


The authors performed a custom variable transformation on "Step" and "Speed" factors in the designed experiment conducted per this study (see attached jmp data table 'Phase I DSD'). The original Step and Speed variables were multiplied by 1/2 if the Difficulty rating for the Question was == Difficult, else no multiplier was used if the Difficulty rating was == Easy:

PatrickGiuliano_1-1646034524479.png   PatrickGiuliano_2-1646034576102.png

In introducing two additional levels for each of these factors, we potentially provide greater discrimination between raters on the basis of each of the inputs and the raters' ability to pick the Correct Answer.  Notice how we get enhanced coverage in the middle of the factor space after applying this transformation: 




But do we need to go from 3 levels (cubic term estimation) to 5 levels (polynomial term estimation)?


There are two potential concerns, the first is over-fitting, the second is reduced power when adding more parameters (model terms).  If after model term reduction, the parametrization reduces to 3 levels, then there is no true benefit to introducing 2 additional levels.

Article Tags