YouTubeAudit-data

Data to accompany the CSCW paper Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube.

View the Project on GitHub social-comp/YouTubeAudit-data

Data Description

YouTube audit data was collected during audit experiments—Search and Watch audits. The audit experiments shed light on the phenomenon of algorithmically surfaced misinformation on YouTube and how that is affected by personalization attributes (gender, age, geolocation, and watch history). While Search audits are conducted using brand new user accounts, Watch audits examine user accounts that have built watch history by systematically watching either all promoting, neutral, or debunking videos of potentially misinformative topics. Our experiments collect 56,475 YouTube videos, spread across five popular misinformative topics and correspond to three major components of YouTube: videos present in search results, Up-Next, and Top 5 recommendations. The audit data is spread across four files. The description of each file along with their downloadable link is listed below.

1. Queries file

filename: queries.csv (download) The file consists of a complete list of 49 search queries used in the audit study. It contains the following fields: -

A snippet:

ID    Topic     Seed Query   Query                                                                                                      
0     9/11     conspiracy theories     11-Sep   11-Sep

2. Annotation Files

A snippet:

qid	vid_url	vid_title	aria-label	annotation	stance	normalized_annotation	age_group	gender	activity	activity_type	topic	geolocation	geo_temperature	component_name	order	vid_order                                                         

0	https://www.youtube.com/watch?v=9gCN7pIX3Es	Rare video from ground zero on 9/11	Rare video from ground zero on 9/11 by 60 Minutes 5 months ago 3 minutes, 14 seconds 1,072,478 views	0		0	3	Female	Search	Demographics	all	us-east1-b (South Carolina)		SearchResults

A snippet:

topic	aria-label	description	vid_title	vid_url	annotation	notes	normalized_annotation	duration	viewCount	likeCount	dislikeCount	favoriteCount	commentCount	popularity

911	Haunting, never-before-seen images of Ground Zero by CBS 7 years ago 3 minutes, 57 seconds 2,385,714 views	A few days after 9/11, FEMA sent its own cameras down into the ruins of the World Trade Center, filming for over 8 months and getting images no one else was able to get. CBS News justice and homeland security correspondent Bob Orr reports.	Haunting, never-before-seen images of Ground Zero	https://www.youtube.com/watch?v=coqYraFn-B4	0		0	237	2396779	8366	718	0	2649	2408512

3. Popularity Metric Files

foldername: popularity_metric.zip (download) In our watch audit experiment, we build the history of YouTube accounts by automatically making them watch videos that are either all debunking, neutral or promoting the particular misinformative topic under audit investigation. We select 20 most popular videos for each of the misinformative topics. We define popularity of a video as the engagement accumulated by that video at the time of our experimental runs. It is calculated as: -

popularity metric = like count + dislike count + view count + comment count + favorite count

The folder consists of 15 files (5 misinformative topics X 3 misinformative stance). Each file contains a list of video URLs that were used to build YouTube accounts’ watch history along with video metadata (duration, view count, like count, dislike count, favorite count and comment count) and populatity metric value for every video.

A snippet:

Id	qid	topic	query	vid_url	vid_title	aria-label	Stance	duration	viewCount	likeCount	dislikeCount	favoriteCount	commentCount	popularity
17	0	9/11 conspiracy theories	11-Sep	https://www.youtube.com/watch?v=MNyjZJOEXpE	How the 9/11 terror attacks unfolded | Telegraph Time Tunnel	How the 9/11 terror attacks unfolded | Telegraph Time Tunnel by The Telegraph 2 years ago 2 minutes, 8 seconds 3,589,933 views	-1	128	3987854	8264	2786	0	1	3998905

4. SERP-MS scores

filename: all_Top10_SERP-MM.csv (download) The file contains the SERP-MS scores (SERP Misinformation Score) of the search engine results page retrieved during the audit experiments. SERP-MS is a scoring metric that captures the amount of misinformation while taking into account the ranking of search results. It can be calculated as : -

where r is the rank of the search result, n is the number of search results present in the SERP and x is the annotation value (-1: promoting, 0:neutral or 1:debunking). We only consider the top 10 search results for computing SERP-MS. Thus, SERP-MS is a continuous value ranging between -1 (all top 10 videos are debunking) to +1 (all top 10 are promoting).

The file contains the following fields: -

A snippet:

qid	query	query_stance	topic	age_group	gender	activity	activity_type	stance	geolocation	geo_temperature	normalized_smm
0	11-Sep	0	911	2	Female	Watch	Geolocation	neutral	Georgia	cold	0

Citation Information

If you use the dataset in your research, please cite the following paper:

Eslam Hussein, Prerna Juneja, Tanushree Mitra. “Measuring Misinformation in Video Search Platforms: An Audit Study on YouTube”. The 23rd ACM Conference on Computer-Supported Cooperative Work and Social Computing. 2020.