October 20, 2022

What content did Pornhub remove?

An analysis of what's missing from the platform.

In 2019, Components released the first analysis I’m aware of on the ascent and dominance of step-fantasy porn on Pornhub, based on metadata from a sample of 218,000 videos spanning the years 2008 to 2018. This analysis later served as the basis of Mashable’s treatment of the same topic. At the end of 2020, New York Times columnist Nicholas Kristof published a pitch-black article on Pornhub owner MindGeek’s repeated handwaving when dealing with complaints of illicit content on its various platforms, including underage and non-consensual videos. In response to the article, Visa and MasterCard ceased processing payments for all of MindGeek’s properties, confronting the company with an existential crisis.

MindGeek responded with a sweeping takedown of all unverified (i.e., no blue check) content. Over the course of two days, its library of 14 million videos was reduced to 4.7 million. Some performers complained that, in its survivalist throes, MindGeek overreacted and purged content that was fully compliant with the site’s terms and conditions. As one said, “When I email Pornhub about this, they put my videos back up, but they’re eventually flagged and disabled again, and the cycle repeats.”

Nearly two years later, it's unclear whether MindGeek has reached something approaching equilibrium in its content moderation (its CEO and COO resigned three months ago, so the company as a whole isn't quite homeostatic), but the working assumption of this analysis is that things are less frenetic than they were in the immediate aftermath of the exposé. Therefore, we can cross-reference the same dataset used for the original report, see what’s missing, and examine the characteristics of banned content on Pornhub that falls outside of the site’s mere overcorrection.

To determine whether a piece was disabled, a script made a request to every URL in the original 218,000-video dataset. Disabled videos on Pornhub have different notices. In the entire dataset, the following notices were returned, with the number of occurrences in parentheses:

“Page not found [without additional notice]” (81,639)
“This video has been disabled” (24,204)
“This video was deleted.” (6,636)
“Video has been removed at the request of the copyright holder” (6,202)
“Video has been removed due to a violation of the Terms and Conditions” (3,799)
“Video has failed encoding” (2)

The analysis here makes no distinction between these notices; an assumption is that in the harried days following the Times expose, Pornhub’s threadbare content moderation team was less than meticulous in applying specific notices to specific videos, trying to simply remove as much potentially violating content as possible. The one technical notice is so infrequent as to be negligible.

In total, of the 218,000-video sample, 56% of the videos are now gone. The fact that this number is significantly higher than the fraction of what was left in the days following the Times article could either be a sampling error, or it could reflect that what we're analyzing is not the result of the the blind shotgun approach of late 2020, but is content that the site's moderators have genuinely deemed violating in their new, more functional moderation regime.

The following table describes some comparative characteristics, including comparisons for all years 2008 to 2018, and for 2018 alone, the latter perhaps being a more apples-to-apples comparison, as disabled views skew more towards earlier years, and don’t account for Pornhub’s greater popularity and algorithmic changes in later years:

	Enabled	Disabled	D-to-E ratio
Median views (all years)	158,015	271,763	1.72
Median upvotes (all years)	325	487	1.49
Upvote-to-downvote ratio (all years)	3.18	3.72	-
Median views (2018)	110,451	275,216	2.49
Median upvotes (2018)	316	885	2.8
Upvote-to-downvote ratio (2018)	2.87	3.43	-

In other words, videos published in 2018 that are now disabled were viewed about two-and-a-half times as much as videos that remain live, and they were upvoted 2.8 times as much. They were also downvoted less, as the ratio of upvotes to downvotes is higher for disabled videos in both the all-years and 2018 groups.

We can also look at the nature of banned content on two different levels: videos’ categories, and their text. This graph demonstrates the proportion of content with different category tags that are now disabled (for example, 85.67% of videos tagged “Celebrity” were subsequently disabled):

Only categories that appeared at least 100 times are included. While I’m tempted to shower commentary on each one of these results and imbue them with meaning (which I do think they possess), I won’t, except to instruct on the perhaps statistically obvious fact that the further above the yellow line a category is, the more that category is associated with removed content, and the further below, the less associated with removed content. I’ll also clarify a point of confusion among many who saw this graph pre-publication, which is that there was in fact a lot of music on Pornhub, both in the form of “porn music videos” (whose abbreviation we’ll see again shortly) as well as even just regular music videos.

At a far more granular level, we can look at the text of the videos’ titles by lemmatizing the titles' words. In lay terms, lemmatization reduces words to their roots and infinitives. For example, if we lemmatized the tokens of the sentence “i am showing you how tokenization works", we would get the tokens [i, be, show, -PRON-, how, tokenization, work], allowing us to remove the noise of difference tenses, plurals, and so on. (Note that the lowercased ‘i’ does not lemmatize to ‘-PRON-`, which is significant in the following results.)

Rather than a graph, here’s a sheet of all the tokens that occurred at least 98 times, with the number of that token's occurrences, and the percentage of videos with that token that was disabled. All of the titles were lowercased and stripped of numbers and punctuation beforehand to reduce redundancy:

Many of these results, particularly at the very top, speak for themselves, while others do not. The list is full of abbreviations, like “pmv” (“porn music video”), “amwf” (“asian male white female”), and so on. I’ve left individual letters, despite the difficulty of identifying their significance detached from the data. For example “y” occurs in both Spanish videos, as well titles with things like “18 y.o.” "i" seems to pretty universally stand for the first-person pronoun. And so on.

Parsing the meaning (cultural, psychoanalytic, journalistic, algorithmic, etc.) of these results is a project in its own right. Hopefully someone can perform the interpretive due diligence on what's here. To that effect, the revised Pornhub dataset with columns relating to video's status is now also available to use.

Notes

Tokenization was performed with Spacy’s large English language model. The process of tokenization and filtering mostly followed that laid out in the KNN tutorial.