Data to accompany the CSCW 2020 paper Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube. https://dl.acm.org/doi/10.1145/3392854
This webpage contains details about the data accompanying CSCW 2020 paper Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube. The data was collected during audit experiments—Search and Watch audits. The audit experiments shed light on the phenomenon of algorithmically surfaced misinformation on YouTube and how that is affected by personalization attributes (gender, age, geolocation, and watch history). While Search audits are conducted using brand new user accounts, Watch audits examine user accounts that have built watch history by systematically watching either all promoting, neutral, or debunking videos of potentially misinformative topics. Our experiments collect 56,475 YouTube videos, spread across five popular misinformative topics and correspond to three major components of YouTube: videos present in search results, Up-Next, and Top 5 recommendations. The audit data is spread across four files. The description of each file along with their downloadable link is listed below.
1. Queries file
filename: queries.csv (download) The file consists of a complete list of 49 search queries used in the audit study. It contains the following fields: -
ID:
unique ID assigned to the search queryTopic:
name of the misinformative search topic (9/11 conspiracy theories, chemtrail conspiracy theory, flat earth, moon landing conspiracy theories or vaccine controversies)Seed Query:
a collection of keywords representing the search topicQuery:
search query nameA snippet:
ID Topic Seed Query Query
0 9/11 conspiracy theories 11-Sep 11-Sep
2. Annotation Files
qid:
unique ID assigned to the search queryvid_url:
YouTube video URLvid_title:
title of the YouTube videoaria-label:
aria-label of the YouTube videoannotation stance:
stance of the video watched during Watch audits.normalized_annotation:
3-point normalized scores with values -1(Promoting) , 0 (Neutral) and 1 (Debunking).age_group:
age group set while creating the google account This field can take values 1 (<18yrs), 2 (18-34yrs), 3 (35-50yrs) and 4 (>50yrs)gender:
gender set while creating google account (male/female)activity:
audit experiment during which the video was collected (search/watch)activity_type:
personalization attribute audited (demographics/geolocation)topic:
name of the misinformative search topicgeolocation:
geolocation where the experiment was performedgeo_temperature:
type of geolocation (hot/cold)component_name:
YouTube component audited/collected (Top5,SearchResults or UpNext)order:
any record that has an order o indicates that it was gathered from the video page of oth watched video. Recall that in our watch audits every account builds up their account history by watching 20 most popular videosvid_order:
order of the video in top 5 recommendation listA snippet:
qid vid_url vid_title aria-label annotation stance normalized_annotation age_group gender activity activity_type topic geolocation geo_temperature component_name order vid_order
0 https://www.youtube.com/watch?v=9gCN7pIX3Es Rare video from ground zero on 9/11 Rare video from ground zero on 9/11 by 60 Minutes 5 months ago 3 minutes, 14 seconds 1,072,478 views 0 0 3 Female Search Demographics all us-east1-b (South Carolina) SearchResults
topic:
misinformative search topicaria-label:
aria-label of the YouTube videodescription:
description of the YouTube videovid_title:
title of the YouTube videovid_url:
URL of the YouTube videoannotation:
same as (2)normalized_annotation:
same as (2)duration:
duration of the YouTube videoviewCount:
view count of the YouTube videolikeCount:
like count of the YouTube videodislikeCount:
dislike of the YouTube videofavoriteCount:
favorite count of the YouTube videocommentCount:
comment count of the YouTube videopopularity:
popularity metric value (see 3 for more details) of the YouTube videoA snippet:
topic aria-label description vid_title vid_url annotation notes normalized_annotation duration viewCount likeCount dislikeCount favoriteCount commentCount popularity
911 Haunting, never-before-seen images of Ground Zero by CBS 7 years ago 3 minutes, 57 seconds 2,385,714 views A few days after 9/11, FEMA sent its own cameras down into the ruins of the World Trade Center, filming for over 8 months and getting images no one else was able to get. CBS News justice and homeland security correspondent Bob Orr reports. Haunting, never-before-seen images of Ground Zero https://www.youtube.com/watch?v=coqYraFn-B4 0 0 237 2396779 8366 718 0 2649 2408512
3. Popularity Metric Files
foldername: popularity_metric.zip (download) In our watch audit experiment, we build the history of YouTube accounts by automatically making them watch videos that are either all debunking, neutral or promoting the particular misinformative topic under audit investigation. We select 20 most popular videos for each of the misinformative topics. We define popularity of a video as the engagement accumulated by that video at the time of our experimental runs. It is calculated as: -
popularity metric = like count + dislike count + view count + comment count + favorite count
The folder consists of 15 files (5 misinformative topics X 3 misinformative stance). Each file contains a list of video URLs that were used to build YouTube accounts’ watch history along with video metadata (duration, view count, like count, dislike count, favorite count and comment count) and populatity metric value for every video.
A snippet:
Id qid topic query vid_url vid_title aria-label Stance duration viewCount likeCount dislikeCount favoriteCount commentCount popularity
17 0 9/11 conspiracy theories 11-Sep https://www.youtube.com/watch?v=MNyjZJOEXpE How the 9/11 terror attacks unfolded | Telegraph Time Tunnel How the 9/11 terror attacks unfolded | Telegraph Time Tunnel by The Telegraph 2 years ago 2 minutes, 8 seconds 3,589,933 views -1 128 3987854 8264 2786 0 1 3998905
4. SERP-MS scores
filename: all_Top10_SERP-MM.csv (download) The file contains the SERP-MS scores (SERP Misinformation Score) of the search engine results page retrieved during the audit experiments. SERP-MS is a scoring metric that captures the amount of misinformation while taking into account the ranking of search results. It can be calculated as : -
where r is the rank of the search result, n is the number of search results present in the SERP and x is the annotation value (-1: promoting, 0:neutral or 1:debunking). We only consider the top 10 search results for computing SERP-MS. Thus, SERP-MS is a continuous value ranging between -1 (all top 10 videos are debunking) to +1 (all top 10 are promoting).
The file contains the following fields: -
qid:
unique ID assigned to the search queryquery:
search query namequery_stance:
stance assigned to the search query. It can take 3 valuues namely, -1(Promoting) , 0 (Neutral) and 1 (Debunking)topic:
name of the misinformative search topicage_group:
age group set while creating the google accountgender:
gender set while creating google account (male/female)activity:
audit experiment during which the video was collected (search/watch)activity_type:
personalization attribute audited (demographics/geolocation)stance:
stance assigned to the video by the annotators. The annotations are based on 9-point annotation scale ranging from -1 to 7geolocation:
geolocation where the experiment was performedgeo_temperature:
type of geolocation (hot/cold)normalized_smm:
SERP-MM score of the SERPA snippet:
qid query query_stance topic age_group gender activity activity_type stance geolocation geo_temperature normalized_smm
0 11-Sep 0 911 2 Female Watch Geolocation neutral Georgia cold 0
If you use the dataset in your research, please cite the following paper:
Eslam Hussein, Prerna Juneja, and Tanushree Mitra. 2020. Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube. Proc. ACM Hum.-Comput. Interact. 4, CSCW1, Article 048 (May 2020), 27 pages. DOI:https://doi.org/10.1145/3392854.