Automattic Faces Scrutiny Over AI Access Policy

This article is a joint effort by James Giroux & Jyolsna.

After unconfirmed reports of Google entering into a content licensing agreement with Reddit for training its AI, 404 Media claimed yesterday that Automattic is set to sell Tumblr and WordPress.com users’ content to Midjourney and OpenAI. If true, this could mirror an extended partnership that Shutterstock entered into with OpenAI last year.

Claims of 404 Media

404 Media claims insider information about the deal–backed up with documentation–confirming Automattic is in the advanced stages of negotiation with these AI companies. To validate its claims 404 Media quoted Tumblr Product Manager Cyle Gage as he reported on an internal message board, the status of the initial data collection process and how it included content that should not have been collected.

While 404 Media has provided quotes from an internal source, it has not provided any specific proof such as screenshots of conversations or access to source materials to assist others in validating their claims. 404 Media also refers to user content as “users’ data” which can easily be misconstrued as personally identifiable information (PII) or credit card information. Whereas the content being discussed in the article is content that is already publicly available.

Response From Automattic

Within a few hours of 404 Media’s article going up, Automattic released a statement describing its position on content distribution and the rights of all users on WordPress.com and Tumblr to opt out of their public content being included in data shared with AI partners.

Automattic makes the argument that AI regulation and legislation do not yet exist and, as such, is taking these steps to proactively provide users with additional methods of controlling how and where their content is made available. They are creating a pathway for AI partners to get streamlined access to the content users are open to sharing while also taking steps to remove access to content that users no longer want to be shared. In other words, the content in question is already available to the AI companies as it’s publicly crawlable and content deals only make it more accessible and manageable.

Automattic published “Protecting User Choice” emphasizing the following points:

We currently block, by default, major AI platform crawlers—including ones from the biggest tech companies—and update our lists as new ones launch.
We have a setting to discourage search engines from indexing a site on WordPress.com and Tumblr. This signals to search engines not to crawl that content or include it in search results.
We have added similar settings to WordPress.com and Tumblr to discourage crawling by AI companies. If you already discourage search engine indexing, this is automatically enabled.
We will share only public content that’s hosted on WordPress.com and Tumblr from sites that haven’t opted out.

We will share only public content that’s hosted on WordPress.com and Tumblr from sites that haven’t opted out.

The article continues hinting at a deal in the future: “We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control. Our partnerships will respect all opt-out settings. We also plan to take that a step further and regularly update any partners about people who newly opt out and ask that their content be removed from past sources and future training.”

Automattic also released a new tool that “lets you opt out of sharing content from your public blogs with third parties, including AI platforms that use such content for training models. We will engage with AI companies that we can have productive relationships with, and are working to give you an easy way to control access to your content…We already discourage AI crawlers from gathering content from WordPress.com and will continue to do so, save for those with which we partner… We are committed to making sure our partners respect those decisions.”

WordPress.org Users Aren’t Affected

Josepha Haden Chomphosy, Executive Director of WordPress shared this with the community in the Slack channel: “I can confirm that the WordPress project is not involved in selling user data or content for AI training purposes. This has been our consistent stance across the long history of WordPress, even as recently as when I was sharing thoughts for the future of our project heading into 2023.”

Later, Jetpack tweeted that “data from Jetpack connected sites is not included. This only applies to WordPress.com hosted sites.”

Interestingly, Automattic has been struggling to make Tumblr profitable after acquiring it in 2019. Last year Matt revealed that Tumblr is losing $30M each year.

We have reached out to Chenda Ngak (Head of Communications at Automattic) and will update this article once we get her quote.

(WordPress (or WordPress.org) is an open-source CMS while WordPress.com is a hosted platform owned by Automattic, a company founded by Matt Mullenweg. Both are not the same.)

Automattic Faces Scrutiny Over AI Access Policy

Claims of 404 Media

Response From Automattic

WordPress.org Users Aren’t Affected

Aaron D Campbell

Previous PostAvada WordPress Theme Patches Arbitrary File Upload Vulnerability

Next Post#110 – Thomas J. Raef on the Shift in How Hackers Attack, and How to Protect Your Site

Recent Posts