Blocking ChatGPT OpenAI from Using your Website Content

Security Lock Image

I’m an author who posts heavily on the web. I’ve been posting my content here on the internet since before 2000. It’s fair to say that I have hundreds of thousands of pages and images on the web. It’s fairly upsetting to think that an entity like ChatGPT can come along, absorb all of my content for FREE, and then spit it out for random scammers to use in whatever blog / book / email scams they want to perpetrate to scam money out of people.

I want to do my best to keep my content out of those hands.

Now, I hear the voices who say it’s hopeless. That any effort to keep your content under control is doomed to failure. But I’m also a person who lived through the Napster era. During Napster, people were stealing songs from musicians left, right, and center. It was thought, at the time, that musicians were wholly doomed to oblivion. Instead, Napster shut down, Spotify came up, and while things are certainly not “perfect” for musicians, they are certainly better. Sites like YouTube at least give links and credit to musicians. So I imagine the same type of chaos is going to have to happen for authors and artists. We will endure pain and desperation, but hopefully we will come out with some sort of a protective shield so we are at least recognized and acknowledged.

So an important step in that direction is for all of us to mark our content.

Here’s how to do that.

robots.txt

First, you need to have a file robots.txt (it is case sensitive! lower case!) in the root directory of every website you manage. This robots.txt file tells all legitimate scanners / rovers how to handle your website. Yes, there are going to be miscreants out there who ignore it, but we are going to deal with the things we can manage here.

You want to add these two lines to your robots.txt:

User-agent: GPTBot
Disallow: /

Next, you also want to disallow the Common Crawl robot. This is another robot that gathers up data from users and allows random people to use it. So also add in the two lines:

User-agent: CCBot
Disallow: /

If you want to add comments for yourself in your robots.txt file, just start that comment with a # (pound) sign. So you can add a line like:

#GPTBot is what powers ChatGTP – the subsequent lines block ChatGPT

Use whatever comments you want in your robots.txt so you can keep track of what you are doing.

OK, so let’s say you add those two sets of commands to your robots.txt to prevent future inclusion from those two robotic programs. It’s not going to have any impact on the many other robots out there gathering up data. It’s also not going to retroactively remove you from already-build data sets. What else should you consider adding to your website if you’re not a fan of how ChatGPT uses content without permission?

Add a Footer Note about Your Site’s Stance on ChatGPT / AI

I’ve been using the term ChatGPT in this article because it’s currently the best known AI program out there, but there are of course many other variants of AI not called ChatGPT. If you’re not a fan of how AI programs are ingesting human-written content and then using it to generate non-attributed essays, make that clear on your site. Add a line to your header or footer which states that your site is HUMAN WRITTEN and is not available for use by AI programs. It’s not going to stop robots, but it will reassure human beings that you are a real person who is doing your best to be ethical.

I’ve done all sorts of searching and I can’t find any sort of a template for how you would want to phrase this on your website. I already copyright all my own websites with a statement that I personally write all of my own content. You legally don’t even have to do this – the moment you write something, you own it. In this world of AI theft, though, it’s worth it to emphasize even more that you are an actual human who wrote and own this content you’re presenting. So I am going to go with:

All written content on this site is written personally by me Lisa Shea and copyright (c) to me Lisa Shea. I strongly support the rights of authors and do not allow my content to be used or ingested by AI programs such as ChatGPT to be used without attribution or recompense.

Nearly all images on this site are personally created or taken by me Lisa Shea and copyright (c) to me Lisa Shea. There are times that I used fully licensed stock images depending on the content of my articles. In those situations I will credit the stock company I acquired the image from. I do NOT use AI-generated images, unless the specific purpose of the essay is to discuss the ethical issues of AI-generated images, in which case I will clearly indicate that.

I may be adjusting these phrases over time, but at least for now I think that’s a good starting point. If you have any other suggestions or ideas, please let me know!

Authors, Artists, and AI

I am both an author and an artist. It is upsetting that AI programs such as ChatGPT can ingest all of our content WITHOUT PERMISSION and then regurgitate infinite versions of our content without any credit or compensation to the base author.

Let’s say someone enjoys the writing style of George R. R. Martin and also is a rabid homophobe. That person could use AI ingest every piece written by George R. R. Martin, then have a ChatGPT program churn out wildly homophobic content in the style of George R. R. Martin to post everywhere under his name. In “normal times” that would be wholly illegal. In our modern times, it could make that scammer hundreds of thousands of dollars.

I strongly feel an author and artist should maintain creative control over the storylines they write, their characters, their worldbuilding, and definitely their name.

Hopefully, like Napster, we are just currently in a “Wild West” stage and we will enter a new phrase where authors and artists have say over the creative worlds they create.

Let me know if you have any thoughts or comments on this.

A Note about Noindex

There used to be a command for robots.txt called noindex which asked search engines not to index your site. Google stopped using that command on September 1, 2019. So if you still have a noindex command in your robots.txt file, know that it’s probably now just being ignored.

Security lock image sourced from Pixabay

1 Trackback / Pingback

  1. Blocking ChatGPT Use of a Website - Lisa Shea Blog

Leave a Reply

Your email address will not be published.


*


This site uses Akismet to reduce spam. Learn how your comment data is processed.