Choose Language Hide Translation Bar

“Trust Me, I Researched It Online”: Exploring the Bias in Search Engine Results (2022-US-30MP-1131)

Peter Hersh, JMP Senior Systems Engineer, SAS
Hadley Myers, Sr. Systems Engineer, JMP

 

When collecting data for an analysis, we are all very cognizant of the need for an unbiased sample and a true representative of a greater population. Great efforts, often at great expense, are taken to ensure that is the case.  However, this standard is not always applied to other forms of data collection. For many, research into topics of interest start and end with online searches. Using a designed experiment and the visualization/analytic capabilities of JMP 17, we sought to investigate how different search engines in different parts of the world are potentially biasing search results and, therefore, the conclusions we respectively reach on these topics. Join us for this amusing and thought-provoking presentation that you should totally rate five stars prior to viewing to save time.

 

 

Thank  you  all  for  clicking on  this  talk  and  coming  to  watch  it.

This  is  really  about  bias  in  data.

Every  analyst  that  works  on an  project  understands

the  importance  of  ensuring  that  the  data that  they  collect  is  unbiased.

The  steps  are  taken  to  avoid that  at  the  start  of  the  data  collection,

before  the  projects  even  really  begun,

there  are  numerous  checks  that points  along  the  analysis,

and  then  at  the  end, any  conclusion  reached  is  taken

in  the  context  of  potential additional  sources  of  bias.

But this  same  level  of  care  isn't  applied to  online  searches  on  topics  of  interest.

Search  engines  use  algorithms   that  are  designed  to  deliver

personalized  content  that  is   relevant  for  us  as  individuals.

Now,  this  has  advantages.

It  means  that  return  search  hits

are  more  likely  to  be   relevant  and  interesting,

but  it  also  has  disadvantages.

By  definition,  these  are  not  unbiased.

We  have  an  example.

Yeah, I   heard  they  brought up  a  great  point  there,

in  science  and  engineering,

we  take  a  great  care  to  make  sure that  our  samples  are  unbiased.

But  let's  think  of  a  library.

We  walk  into  the  library   and  there's  two  people  interested

in  informing  themselves  on  vaccine  safety.

Let's  say  they  walk  into  a  library

and  ask  a  librarian  for  these  books  on  vaccines,

so the  first  person  receives  three  books.

These  are  actual  books.

Smallp ox: A  Vaccine  Success ,

Anti-vaxxers: H ow  to  Challenge the  Misinformed  Movement,

and   Stuck:  How  Vaccine  Rumors  Start and  Why  They  Don't  Go  Away.

Now,  let's  say  a  different  person  walks  in

and  receives  three completely  different  books.

THE COVID  VACCINE: A nd  silencing  of  our

doctors  and  scientists,

Jabbed: H ow  the  Vaccine  Industry,

Medical  Establishment,  and  Government Stick  it  to  You  and  Your  Family,

Anyone Who Tells You Vaccines  Are Safe and Effective is Lying.

These  are  actual  book  titles.

Let's  say  that  looking  at  who  you  are,

so  where  you  live, how  old  you  are,  your  gender,

maybe  even  your  browser  history

determines  which  of  these  sets  of  books  you  get.

This  is  essentially  some  of  the  problem  with  the  bias

as  you  go  in  to  search  for  things  online.

It  may  be  that  before  we  even  start

looking  at  our  browser  search,   we've  already  got  bias  in  there

and  we  want  to  understand if  that's  the  case  or  not.

That's   what  motivated  this.

You  got  any  thoughts  on  that, H adley?

Well,  I  think  that  the  thought that  I'd  like  to  express  right  now

is  that  the  purpose  of  this  presentation isn't  to  judge  or  to  opine

on  the  advantages  or  disadvantages   of  the  search  algorithms

that  may  or  may  not  be  used.

The  purpose  here  really   is  just  to  take  an  example

of complex/ unstructured  data and complex  because  it  is  unstructured

and  this  was  search  results.

Then  to  use  some  of  the  exploratory  visual

and  analytic  capabilities  found in  JMP  Pro  17  to  try  to  understand

what  we  were  seeing  and  to  present  it in  such  a  way  to  help  you  to  understand.

The  purpose  of  this  presentation  is to  inspire  you  to  try  these  techniques

for  yourselves  and  others like  them  on  your  own  data.

Let's  go  through  briefly  the   methodology.

What  Pete  and  I  did  was  we  came  up

with  a  few  search  terms  we  thought would  lead  to  interesting  results.

You  can  see  those  terms  here.

We  define  some  potential  input  variables

which  may  or  may  not  be  affecting the  results  of  the  search.

We  know  that  there  are  very  likely others  as  well  that  we  didn't  include.

This  is  true  with  any  designed  experiment.

We  can't  capture  every  variable, but  we  took  a  few.

We'll  see  whether  these are  significant  or  not.

We  developed  a  data  collection  procedure

whereby  we  use  the   MSA  design  in  the  DOE  menu.

This  is  a  convenient  way  to  create  these  tables  that  we  can  then  send

to  JMP  SEs  and  friends  of  SEs.

Now,  right  away  this  isn't an  unbiased  random  assortment

of  people  we've  asked   to  fill  out  these.

They're  all  people  that  work  for  the  same company  and  have  the  same  job  title.

As  we  said,  the  purpose  is  really

to  understand  the  techniques  and  methods that  we  use  to  try  to  understand  the  data,

and  then  to  think  about  how you  can  apply  it  yourself.

We  explored  the  results which  we'll  show  you.

Then  finally  we  presented  the  findings at  the  JMP  Discovery  Summit  America  2022,

which  is  what  you  are   watching  right  now.

Without  further  ado,   let's  jump  into  the  data.

I'll  start  out  by  talking  just  briefly about  the  MSA  design  that  you  see  here.

What  we've  done  is  we've   added  the  factors  of  interest,

we've  added  the  terms of  interest  that  we  were  looking  at.

then  the  nice  thing  about  this   is  that  when  we  make  the  table,

what  we  could  always  do  is  press  this  button,

send  these  out  to  everybody  that  needed to  complete  the  results  for  us,

send  them  back,  catenate them

and  then  we're  ready   to  start  beginning  our  analysis.

But  as  any  analyst   who's  ever  collected  data

and  tried  to  analyze  knows the  data  very  often  isn't  in  a  format

where  you  can  immediately   start  with  your  analysis,

some  cleaning  needs  to  be  done.

I'll  pass  things  over to  Pete  to  talk  about  that.

Pete?

Yes,  great  point.

I  think  everybody has  gone  through  this.

Even  with  a  well- designed DOE,

you  oftentimes  have  to  make some  adjustments  to  do  the  analysis.

Hadley  showed  those  operator  worksheets

that  came  out,  and  here is  one  that  I  filled  out.

I'm  not  going  to  keep  myself  anonymous,

but  I  didn't  want  to  share someone  else's  results.

But  just  to  give  you  an  idea, we  had  folks  answer

a  few  demographic  questions that  hopefully  weren't  too  revealing,

but  basically  where  you  were  located,

how  old  you  are, and  then  the  search  term  you  used.

Like  Hadley  showed, there  was  three  responses.

We  had  people  do  this  search   and  then  show  the  top  three  responses

that  that  search  engine  recommended.

To  do  the  analysis,

the  first  thing  we  had  to  do  is  just  take these  three  and  bring  them  together.

This  is  a  nice  easy  way  to  do  this,

is  to  go  under  the  columns, utilities  and  just  combined  columns.

Now  I  just  called  these  responses and  made  a  little  delimiter.

I  unchecked  that  multiple  response   'cause  we're  going  to  just  do

text  analytics  on  this.

Then  you  get  this,

which  is  the  table  that  we, excuse me, the column  we're  using  for  Text  Explorer.

Now,  I  did  this  and  then   I  brought  in  all  of  the  results

from  all  the  different  people  who  took  the  survey

and  tweaked  a  little  bit  more   like  combined  whether  you  were  in  the  US,

which  state  you  were  in,   and  then   summarize  that  into  region

between  America's  and  Europe,

'cause  we  didn't  have  enough  respondents  to   break  it  up  by  state.

But  in  the  end,  we  end  up with  a  table  that  looks  like  this.

We  had  to  do  a  little  bit  of  recoding,

we  had  to  do  a  little  bit  of  filling  this  in

and  then  anonymize   that  search  engine.

The  folks  that  got  the  survey  knew  which  search  engine  to  use,

but  we're  not  sharing  that  here.

Hadley  is  going  to  now  talk  about  some

of  the  results  we  saw  out  of  this once  we  had  it  in  this  form.

Let's  open  up  this dashboard  right  here.

What  you're  seeing   are  the  most  popular  terms

in  order  of  popularity  from  left  to  right,  descending  order

for  the  first  response,  second  response, and  third  response  for  every  one

of  the  search  terms,  for  every  gender and  age  and  all  of  the  other  factors.

W  can  use  this

and  this  hierarchical  filtering   on  the  dashboard  to  explore

this  a  little  closer, see  if  we  can  learn  anything.

One  thing  I  happen  to  notice  if  we  look

at  the  world  is  and  we  click  on  male, you'll  see  that  for  most  people,

or for  many  people,  the  first  search   they  found  was  that  the  world

is  not  enough  if  you're  female, equally  likely  is  to  find  it  your  oyster.

Interestingly,  if  you're  less  than  40  years  old,

that's  when  the  world  is  not  enough,  suddenly  becomes  the  world  is  yours.

I  think  we  could  probably  agree   with  it's  probably  true  for  people

under  40,  isn't  it?

What  else  have  we  got  here?

If  I  look  at  climate  change,

so  another of course  hot  topic  of  interest  these  days,

as  well  it  should  be.

If  I  were  to  look  at  people  over  50,

apparently  a  huge  concern  for  them   is  whether  climate  change

is  changing  babies  in  the  womb,

which,  interestingly, isn't  a  concern  for  people  below  40.

I  wondered  whether

this  is  a  valid  concern   for  people  over  50,

whether  they're  more  likely   to  have  their  babies  change

in  their  wombs  or  not.

But  aside  from  that,

let's  take  a  step  back  and  see  how  we can  go  about  creating  this  dashboard.

It's  quite  simple.

The  first  thing  we  need  to  do   is  to  create  our  filter  variables.

I've  done  that  here.

Here's  our  search  terms and  our  distributions.

What  I'll  do  is  I'll  go  through   how  to  create  the  graph  builder  report,

because  that's  something   that  you  may  not  be  familiar  with,

that  you  might  be interested  in  doing.

I'm  going  to  take  my  first  response,

put  it  here,  and  simply  choose the  number  that  those  occur.

Then  I  can  right  click and  order  by  count  descending.

T hat's  it.

I've  done  the  same  thing   for  my  second  response

and  my  third  response  as  well.

Now  we  can  go  ahead  putting together  the  pieces  of  the  dashboard.

We'll  click  on  new  dashboard,

we'll  choose  the   hierarchical  filter  plus  one.

I'll  take  my  distribution  results put  them  there,

my  input  parameters, put  them  there,

and  then  my  graphs.

Let's  see,  is  this  one  first?

Well, I cant tell.

We'll  just  put  them  in  order  like  this,

and  I  can  always  change the  order  if  I  want.

All  right.

I'll  run  the  dashboard,

and  there  we  have  it.

It  really  is  as  simple  as  that.

Then  I  can  go  ahead and  save  it  to  the  table.

That  was  one  use  of  a  dashboard.

I'll  show  you  another  use  of  a  dashboard,

which  was  to  use  it  with  a Text  Explorer   Word Cloud.

This  is  the  most  common  words,

not  just  entire  phrases  or  entries,  but  individual  words.

You  can  see  the  word  design seem  to  be  used  a  lot.

If  I  were  to  look  at, for  example,  statistics,

so  it  looks  like  everybody  can  agree   that  statistics  is  a  science.

Interestingly,  if  you're  in  Europe, apparently  you  find  it  harder  than  you  do

if  you're  in  America, where  that  doesn't  come  up,

so  something  I  happen  to  notice  there.

To  create  this  dashboard, it's  very  much  the  same  as  the  other  one.

We'll  add  our  distribution  items.

Here's  the  first  one, here's  the  second  one.

We'll  add  our  Text  Explorer  Word cloud,

and  then  we'll  simply  put  this  one  together

just  as  we  did  the  previous  one.

With  that,  I'd  like  to  thank  you

for  this  part  of  the  presentation about  the  exploratory  visual  analysis.

I've  shown  you  how  you  can  go  about

doing  this  using   the  hierarchical  dashboards.

Now  I'll  turn  things  back  over  to  Pete,

who will  take  us  through  some  more in  depth  use  of  the  Text  Explorer.

Perfect. Thanks,  Hadley.

Like  Hadley  mentioned  this  is

a  different  way  to  display  this, but  this  is  the  end  result  of  using

the  Text  Explorer  and  looking just  at  the  Word  Cloud  here.

He  had  made  this  a  dashboard

and  used  filters  that  were   graphical  of  nature,  which  is  great.

You  could  do  this   also  with  a  local  data  filter.

But  this  is  basically the  end  result  we're  going  for.

Let's  now  back  up  and  talk  about  how  we  got  here.

With  our  data  set  over  here, we  just  launched  the  Text  Explorer

under  analyze  menu,

we  put  in  our  column   that  we're  interested  in.

In  this  case,  all  three  responses combined  into  one  column.

We  have  a  bunch  of  options   we  can  use  to  tweak  this,

including  language and  how  we  tokenize  the  words.

But  we're  going  to  go  ahead and  just  use  the  defaults.

Here  you  can  see  since   we  have  different  responses

to  different  search  terms,

the  overall  term  and  phrase  list by  itself  is  not  super  informative.

What  we  would  want  to  do   is  apply  that  local  data  filter

and  the  first  thing  we'll   look  at  is  that  search  term.

Now  we  can  do  something  like  the  economy  or  coronavirus

or  climate  change  and  go  from  there.

Let's  focus  in  on  climate  change  here.

One  thing  that  I  wanted  to  do was  add  some  sentiment  analysis.

The  first  thing  I'm  going  to  go  ahead

and  turn  on  this  Word  Cloud   so  it  looks  like  it  had  before.

Now  we  can  display  it  this  way  where  you

have  the  most  common  term  in  there and  you  can  see  it's  climate  and  change.

We  know  that  we're  searching  that,

so I  could  go  in  here   and  add  these  as  stop words

and  now  see  which  ones  come  up

the  most  frequently  when  we're mentioning  climate  change.

This  is  one  way  to  display  the  Word  Cloud.

I  can  also  go  through  here  and  maybe  change  this  to  something

that  is  a  little  more   appealing  to  the  eye,

but  maybe  less  useful   from  a  quantitative  standpoint.

You  can  always  add   some  arbitrary  colors

if  you  like  that  as  well.

All  right,  so  I've  done  this  to  this  point,

but  now  I  want  to  add  some  sentiment  analysis  to  this.

Are  people  thinking  climate  change

is  natural,  a  good  thing or  is  it  a  bad  thing?

You  can  see  some  things  in  here   that  maybe  indicate  that,

but  I  wasn't  quite  sure where  to  find  sentiment  analysis.

With  JMP  17,  we  have  this new  feature  called  Search  JMP.

If  you're  ever  looking  for  an  analysis  in  JMP,

this  is  a  great  way  to  find  that.

If  I  just  start  typing  in  sentiment,

you  can  see  right  here that  it  tells  me  how  to  find  this,

I  can  do  the  help,  but  I  can  also  just  hit show  me  and  it  launches  it  right  there.

If  I'm  ever  wondering,  hey,  how  can I  do  this,  this  gives  me  the  option.

Now,  a  couple  of  things  you  see  here

it's  identified  some   of  these  default  terms

that  are  providing  sentiment.

Things  like  good.

If  I  click  on  good, I  get  a  little  summary.

It  looks  like  when  people  are  saying  good,

that  is  actually  a  positive  sentiment.

Now,  what  about  greatest?

Oh,  boy,  almost  everything  that  says greatest  is  a  greatest  threat.

Maybe  that's  not  actually a  positive  sentiment  there.

We  might  need  to  do   a  little  bit  of  tweaking.

First  let's  go  in  here  and  say,  okay,  well,  greatest  threat  is  a  phrase

that  we're  seeing  commonly.

I'm  going  to  just  add  that  phrase.

Again,  you  would  do  this  in  your  curation  process,

and  now  you  see  that  that  goes  away.

But  I  think  greatest  threat is  actually  a  negative  thing.

Let's  look  at  those  sentiment  terms.

You  can  see  JMPs  identified  that  as something  that  maybe  has  sentiment.

I'm  going  to  just  say,  you  know  what?

That's  a  really  negative  sentiment.

Now  when  we  go  down  here,

you  can  see  that  it's  flagged  those  seven  occurrences

where  they  mentioned  greatest  threat,

and  it  said  that  those are  a  very  negative.

That's  changed  our  overall  impression  of,  do  most  of  these  search  terms

think  this  is  negative  or  positive?

That's  just  an  example  of  how  you  can

walk  through  that  flow  and  come  up with  the  end  sentiment  analysis.

I'm  going  to  pass  it  back  over  to  Hadley

and  let  him  wrap  things  up  here.

What  I'd  like  to  say   is  that  we  showed  you,

first  of  all,  how  we  went  about  using

the  MSA  Design  to  help   with  the  data  collection.

We  use  Recode  and  other  items

in  the  Tables  menu  to  help with  the  data  clean  up.

We  then  used  Distribution,  Graph  Builder,

Text  Explorer,  and  combinations, all  of  them  together  to  help

with  the  data  exploration,

see  if  we  can  uncover  anything  interesting.

Then  Pete  used  Sentiment  Analysis together  with  the  Search  and  JMP  17

to  see  what  else  we  can  learn about  the  data  as  a  whole.

With  that,  I  hope  you  found  this  useful.

I  hope  it's  given  you  some  ideas  about  how

you  can  do  this  on  your own  data  for  yourselves.

I'd  like  to  thank  you  all  for  listening,

and  I  hope  you  enjoy  the  rest of  the  JMP  Discovery  Conference.

Thank  you.