“Trust Me, I Researched It Online”: Exploring the Bias in Search Engine Results (2022-US-30MP-1131)

9 Kudos

Invited Paper Winner

Peter Hersh, JMP Senior Systems Engineer, SAS
Hadley Myers, Sr. Systems Engineer, JMP

When collecting data for an analysis, we are all very cognizant of the need for an unbiased sample and a true representative of a greater population. Great efforts, often at great expense, are taken to ensure that is the case. However, this standard is not always applied to other forms of data collection. For many, research into topics of interest start and end with online searches. Using a designed experiment and the visualization/analytic capabilities of JMP 17, we sought to investigate how different search engines in different parts of the world are potentially biasing search results and, therefore, the conclusions we respectively reach on these topics. Join us for this amusing and thought-provoking presentation that you should totally rate five stars prior to viewing to save time.

Thank you all for clicking on this talk and coming to watch it.

This is really about bias in data.

Every analyst that works on an project understands

the importance of ensuring that the data that they collect is unbiased.

The steps are taken to avoid that at the start of the data collection,

before the projects even really begun,

there are numerous checks that points along the analysis,

and then at the end, any conclusion reached is taken

in the context of potential additional sources of bias.

But this same level of care isn't applied to online searches on topics of interest.

Search engines use algorithms that are designed to deliver

personalized content that is relevant for us as individuals.

Now, this has advantages.

It means that return search hits

are more likely to be relevant and interesting,

but it also has disadvantages.

By definition, these are not unbiased.

We have an example.

Yeah, I heard they brought up a great point there,

in science and engineering,

we take a great care to make sure that our samples are unbiased.

But let's think of a library.

We walk into the library and there's two people interested

in informing themselves on vaccine safety.

Let's say they walk into a library

and ask a librarian for these books on vaccines,

so the first person receives three books.

These are actual books.

Smallp ox: A Vaccine Success ,

Anti-vaxxers: H ow to Challenge the Misinformed Movement,

and Stuck: How Vaccine Rumors Start and Why They Don't Go Away.

Now, let's say a different person walks in

and receives three completely different books.

THE COVID VACCINE: A nd silencing of our

doctors and scientists,

Jabbed: H ow the Vaccine Industry,

Medical Establishment, and Government Stick it to You and Your Family,

Anyone Who Tells You Vaccines Are Safe and Effective is Lying.

These are actual book titles.

Let's say that looking at who you are,

so where you live, how old you are, your gender,

maybe even your browser history

determines which of these sets of books you get.

This is essentially some of the problem with the bias

as you go in to search for things online.

It may be that before we even start

looking at our browser search, we've already got bias in there

and we want to understand if that's the case or not.

That's what motivated this.

You got any thoughts on that, H adley?

Well, I think that the thought that I'd like to express right now

is that the purpose of this presentation isn't to judge or to opine

on the advantages or disadvantages of the search algorithms

that may or may not be used.

The purpose here really is just to take an example

of complex/ unstructured data and complex because it is unstructured

and this was search results.

Then to use some of the exploratory visual

and analytic capabilities found in JMP Pro 17 to try to understand

what we were seeing and to present it in such a way to help you to understand.

The purpose of this presentation is to inspire you to try these techniques

for yourselves and others like them on your own data.

Let's go through briefly the methodology.

What Pete and I did was we came up

with a few search terms we thought would lead to interesting results.

You can see those terms here.

We define some potential input variables

which may or may not be affecting the results of the search.

We know that there are very likely others as well that we didn't include.

This is true with any designed experiment.

We can't capture every variable, but we took a few.

We'll see whether these are significant or not.

We developed a data collection procedure

whereby we use the MSA design in the DOE menu.

This is a convenient way to create these tables that we can then send

to JMP SEs and friends of SEs.

Now, right away this isn't an unbiased random assortment

of people we've asked to fill out these.

They're all people that work for the same company and have the same job title.

As we said, the purpose is really

to understand the techniques and methods that we use to try to understand the data,

and then to think about how you can apply it yourself.

We explored the results which we'll show you.

Then finally we presented the findings at the JMP Discovery Summit America 2022,

which is what you are watching right now.

Without further ado, let's jump into the data.

I'll start out by talking just briefly about the MSA design that you see here.

What we've done is we've added the factors of interest,

we've added the terms of interest that we were looking at.

then the nice thing about this is that when we make the table,

what we could always do is press this button,

send these out to everybody that needed to complete the results for us,

send them back, catenate them

and then we're ready to start beginning our analysis.

But as any analyst who's ever collected data

and tried to analyze knows the data very often isn't in a format

where you can immediately start with your analysis,

some cleaning needs to be done.

I'll pass things over to Pete to talk about that.

Pete?

Yes, great point.

I think everybody has gone through this.

Even with a well- designed DOE,

you oftentimes have to make some adjustments to do the analysis.

Hadley showed those operator worksheets

that came out, and here is one that I filled out.

I'm not going to keep myself anonymous,

but I didn't want to share someone else's results.

But just to give you an idea, we had folks answer

a few demographic questions that hopefully weren't too revealing,

but basically where you were located,

how old you are, and then the search term you used.

Like Hadley showed, there was three responses.

We had people do this search and then show the top three responses

that that search engine recommended.

To do the analysis,

the first thing we had to do is just take these three and bring them together.

This is a nice easy way to do this,

is to go under the columns, utilities and just combined columns.

Now I just called these responses and made a little delimiter.

I unchecked that multiple response 'cause we're going to just do

text analytics on this.

Then you get this,

which is the table that we, excuse me, the column we're using for Text Explorer.

Now, I did this and then I brought in all of the results

from all the different people who took the survey

and tweaked a little bit more like combined whether you were in the US,

which state you were in, and then summarize that into region

between America's and Europe,

'cause we didn't have enough respondents to break it up by state.

But in the end, we end up with a table that looks like this.

We had to do a little bit of recoding,

we had to do a little bit of filling this in

and then anonymize that search engine.

The folks that got the survey knew which search engine to use,

but we're not sharing that here.

Hadley is going to now talk about some

of the results we saw out of this once we had it in this form.

Let's open up this dashboard right here.

What you're seeing are the most popular terms

in order of popularity from left to right, descending order

for the first response, second response, and third response for every one

of the search terms, for every gender and age and all of the other factors.

W can use this

and this hierarchical filtering on the dashboard to explore

this a little closer, see if we can learn anything.

One thing I happen to notice if we look

at the world is and we click on male, you'll see that for most people,

or for many people, the first search they found was that the world

is not enough if you're female, equally likely is to find it your oyster.

Interestingly, if you're less than 40 years old,

that's when the world is not enough, suddenly becomes the world is yours.

I think we could probably agree with it's probably true for people

under 40, isn't it?

What else have we got here?

If I look at climate change,

so another of course hot topic of interest these days,

as well it should be.

If I were to look at people over 50,

apparently a huge concern for them is whether climate change

is changing babies in the womb,

which, interestingly, isn't a concern for people below 40.

I wondered whether

this is a valid concern for people over 50,

whether they're more likely to have their babies change

in their wombs or not.

But aside from that,

let's take a step back and see how we can go about creating this dashboard.

It's quite simple.

The first thing we need to do is to create our filter variables.

I've done that here.

Here's our search terms and our distributions.

What I'll do is I'll go through how to create the graph builder report,

because that's something that you may not be familiar with,

that you might be interested in doing.

I'm going to take my first response,

put it here, and simply choose the number that those occur.

Then I can right click and order by count descending.

T hat's it.

I've done the same thing for my second response

and my third response as well.

Now we can go ahead putting together the pieces of the dashboard.

We'll click on new dashboard,

we'll choose the hierarchical filter plus one.

I'll take my distribution results put them there,

my input parameters, put them there,

and then my graphs.

Let's see, is this one first?

Well, I cant tell.

We'll just put them in order like this,

and I can always change the order if I want.

All right.

I'll run the dashboard,

and there we have it.

It really is as simple as that.

Then I can go ahead and save it to the table.

That was one use of a dashboard.

I'll show you another use of a dashboard,

which was to use it with a Text Explorer Word Cloud.

This is the most common words,

not just entire phrases or entries, but individual words.

You can see the word design seem to be used a lot.

If I were to look at, for example, statistics,

so it looks like everybody can agree that statistics is a science.

Interestingly, if you're in Europe, apparently you find it harder than you do

if you're in America, where that doesn't come up,

so something I happen to notice there.

To create this dashboard, it's very much the same as the other one.

We'll add our distribution items.

Here's the first one, here's the second one.

We'll add our Text Explorer Word cloud,

and then we'll simply put this one together

just as we did the previous one.

With that, I'd like to thank you

for this part of the presentation about the exploratory visual analysis.

I've shown you how you can go about

doing this using the hierarchical dashboards.

Now I'll turn things back over to Pete,

who will take us through some more in depth use of the Text Explorer.

Perfect. Thanks, Hadley.

Like Hadley mentioned this is

a different way to display this, but this is the end result of using

the Text Explorer and looking just at the Word Cloud here.

He had made this a dashboard

and used filters that were graphical of nature, which is great.

You could do this also with a local data filter.

But this is basically the end result we're going for.

Let's now back up and talk about how we got here.

With our data set over here, we just launched the Text Explorer

under analyze menu,

we put in our column that we're interested in.

In this case, all three responses combined into one column.

We have a bunch of options we can use to tweak this,

including language and how we tokenize the words.

But we're going to go ahead and just use the defaults.

Here you can see since we have different responses

to different search terms,

the overall term and phrase list by itself is not super informative.

What we would want to do is apply that local data filter

and the first thing we'll look at is that search term.

Now we can do something like the economy or coronavirus

or climate change and go from there.

Let's focus in on climate change here.

One thing that I wanted to do was add some sentiment analysis.

The first thing I'm going to go ahead

and turn on this Word Cloud so it looks like it had before.

Now we can display it this way where you

have the most common term in there and you can see it's climate and change.

We know that we're searching that,

so I could go in here and add these as stop words

and now see which ones come up

the most frequently when we're mentioning climate change.

This is one way to display the Word Cloud.

I can also go through here and maybe change this to something

that is a little more appealing to the eye,

but maybe less useful from a quantitative standpoint.

You can always add some arbitrary colors

if you like that as well.

All right, so I've done this to this point,

but now I want to add some sentiment analysis to this.

Are people thinking climate change

is natural, a good thing or is it a bad thing?

You can see some things in here that maybe indicate that,

but I wasn't quite sure where to find sentiment analysis.

With JMP 17, we have this new feature called Search JMP.

If you're ever looking for an analysis in JMP,

this is a great way to find that.

If I just start typing in sentiment,

you can see right here that it tells me how to find this,

I can do the help, but I can also just hit show me and it launches it right there.

If I'm ever wondering, hey, how can I do this, this gives me the option.

Now, a couple of things you see here

it's identified some of these default terms

that are providing sentiment.

Things like good.

If I click on good, I get a little summary.

It looks like when people are saying good,

that is actually a positive sentiment.

Now, what about greatest?

Oh, boy, almost everything that says greatest is a greatest threat.

Maybe that's not actually a positive sentiment there.

We might need to do a little bit of tweaking.

First let's go in here and say, okay, well, greatest threat is a phrase

that we're seeing commonly.

I'm going to just add that phrase.

Again, you would do this in your curation process,

and now you see that that goes away.

But I think greatest threat is actually a negative thing.

Let's look at those sentiment terms.

You can see JMPs identified that as something that maybe has sentiment.

I'm going to just say, you know what?

That's a really negative sentiment.

Now when we go down here,

you can see that it's flagged those seven occurrences

where they mentioned greatest threat,

and it said that those are a very negative.

That's changed our overall impression of, do most of these search terms

think this is negative or positive?

That's just an example of how you can

walk through that flow and come up with the end sentiment analysis.

I'm going to pass it back over to Hadley

and let him wrap things up here.

What I'd like to say is that we showed you,

first of all, how we went about using

the MSA Design to help with the data collection.

We use Recode and other items

in the Tables menu to help with the data clean up.

We then used Distribution, Graph Builder,

Text Explorer, and combinations, all of them together to help

with the data exploration,

see if we can uncover anything interesting.

Then Pete used Sentiment Analysis together with the Search and JMP 17

to see what else we can learn about the data as a whole.

With that, I hope you found this useful.

I hope it's given you some ideas about how

you can do this on your own data for yourselves.

I'd like to thank you all for listening,

and I hope you enjoy the rest of the JMP Discovery Conference.

Thank you.

Phil_Kay · ‎09-26-2022

Really fun and thought-provoking. And a good way to highlight some useful JMP and JMP Pro features. What a great idea!

Peter_Hersh · ‎09-26-2022

Thanks Phil