WSJ and CNN Unhappy With Articles Used to Train ChatGPT

Forum Moderators: open

Message Too Old, No Replies

WSJ and CNN Unhappy With Articles Used to Train ChatGPT

engine

5:26 pm on Feb 20, 2023 (gmt 0)

Journals such as Wall Street Journal and CNN are criticising to use by OpenAI to train ChatGPT using articles in their publications.

It had to happen at some point: The golden goose won't be quite so golden if it has to licence and/or pay. I guess as individual webmasters we don't have the "clout" of a major publication.

[bloomberg.com...]

Sgt_Kickaxe

5:47 pm on Feb 20, 2023 (gmt 0)

A lawsuit was always going to happen, it does with any sufficiently new technology. Existing laws need amending, they don't cover this type of content use.

Though a lawsuit won't likely win, because there is no law covering this yet, a lawsuit will cause a review and amendments to existing laws, or entirely new laws, to be made.

You can't follow the rules when they don't exist yet.

Brett_Tabke

6:22 pm on Feb 20, 2023 (gmt 0)

What about services like Moz or SemRush that use public website data to build their service? Are they next?

Dimitri

6:25 pm on Feb 20, 2023 (gmt 0)

they don't cover this type of content use.

I don't see how current laws are not covering this usage.

Laws say you can't use someone else copyrighted work without authorization/license/etc... Training an AI with copyrighted material is perfectly covered no need to change laws.

Also, rewording and mixing existing material , is subject to plagiarism laws too.

graeme_p

7:00 pm on Feb 20, 2023 (gmt 0)

Laws say you can't use someone else copyrighted work without authorization/license/etc

It says you cannot make copies of copyright work apart from certain exceptions.

Training an AI with copyrighted material is perfectly covered no need to change laws.

Training an AI with copyright material seems to be allowed. What law forbids it? It may be a breach of copyright to take copies to create a training dataset, but then the same would be true of search indices and caches of web sites.

rewording and mixing existing material , is subject to plagiarism laws too.

What plagiarism laws in particular?

Dimitri

7:16 pm on Feb 20, 2023 (gmt 0)

Under the Copyright Act, a compilation is a "work formed by the collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged in such a way that the resulting work as a whole constitutes an original work of authorship. The term compilation includes collective works" 17 U.S.C. 101. This gives the compilation a separate copyright from any of the individual pieces within it. An author who creates a compilation owns the copyright of the compilation but not of the component parts. The author can compile material even if someone else owns the copyright, but the author must get the rights holders� permission to do so
[law.cornell.edu...]

To me the training of an AI is effectively "a collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged". And therefor is requires authorization from the rights ' holders.

Plagiarism can warrant legal action if it infringes upon the original author�s copyright, patent, or trademark. Plagiarism can also result in a lawsuit if it breaches a contract with terms that only original work is acceptable.

To avoid plagiarism, a person should always properly attribute any information they use to the original author through quotes or citations.
[law.cornell.edu...]

ChatGPT is doing nothing of this.

but then the same would be true of search indices and caches of web sites.

By not blocking a search engine crawler, you grants the search engine the right to use your data to build its index. Same for web sites caches. If someone is caching your site, against your will, you have the right to fill a lawsuit.

graeme_p

10:37 pm on Feb 20, 2023 (gmt 0)

To me the training of an AI is effectively "a collection and assembling of preexisting materials or of data that are selected, coordinated, or arranged"

To you, maybe. Can you provide a precedent for any judge sharing that view? As we are talking about US law here, ideally that would be a majority of the justices of the Supreme Court of the US.

One could (and people do) argue that there is no such thing as an original creative work, because all works are rearrangements of elements from previous works. Applying that view to copyright law would cause significant practical difficulties.

From your link:

Plagiarism is not illegal in the United States in most situations. Instead it is considered a violation of honor or ethics codes

ChatGPT, is dishonourable!

By not blocking a search engine crawler, you grants the search engine the right to use your data to build its index.

By not blocking a search engine crawler, you grants the search engine the right to use your data to build its index

So by not blocking the crawlers used by ChatGPT's systems to assemble training data, you gave them permission.

If someone is caching your site, against your will, you have the right to fill a lawsuit.

A lot of "it depends" there.

Dimitri

11:15 pm on Feb 20, 2023 (gmt 0)

Sorry for having expressed my opinion, I didn't mean to upset you. I won't do it again.

tangor

2:05 am on Feb 21, 2023 (gmt 0)

Don't worry about it @Dimitri! This is all opinion as I don't see any self-professed lawyers present (or judges) or the dispensing of legal advice.

While we all accumulate knowledge from various sources over our lifetimes, we can, and often do, create a unique (and copyrightable) outlook that can be printed, filmed, recorded (audio) or put on the web. THAT is the copyright we all depend upon. Most nations have copyright laws of various kinds and most have real teeth.

ChatGPT is a mashup algo that APPEARS creative, but really is nothing more than an advanced spinner with a clever interface.

buckworks

3:06 am on Feb 21, 2023 (gmt 0)

So by not blocking the crawlers used by ChatGPT's systems

Could someone pretty-please explain how to block those crawlers and their ilk?

phranque

4:34 am on Feb 21, 2023 (gmt 0)

can't be done.
ChatGPT doesn't have crawlers.

graeme_p

11:05 am on Feb 21, 2023 (gmt 0)

ChatGPT doesn't have crawlers.

So how did they get their content? If they did not crawl the sites they must have got access to the data somehow. That actually makes it even more likely their use was legal as they would have paid for access.

I notice that the public part of the article says the are "criticising" not "suing" so it sounds like they do not have a legal case - big media are hardly shy of using lawyers.

Sorry for having expressed my opinion, I didn't mean to upset you. I won't do it again.

You did not upset me. I just explained why you are wrong.

[xkcd.com...]

:)

This is all opinion as I don't see any self-professed lawyers present (or judges) or the dispensing of legal advice.

My point exactly. Ultimately its all speculation until a judge rules on it.

I do have a slight advantage over most people here, as, although it was a long time ago and only covered a narrow area, and was UK rather than US, I do have some professional level training in company and business law.

ChatGPT is a mashup algo that APPEARS creative, but really is nothing more than an advanced spinner with a clever interface

True in that it just juggles words. However, its output is often indistinguishable from a human being's. I would compare it to low level article writing where people are writing based on looking at what others have written (SEO driven articles, WIkipedia articles, etc.) rather than to spinners.

What I think ChatGPT and the like will do is destroy demand for not just spinners, but content farms and "SEO article writing". Unfortunately, by providing a better alternative to content farms it is also going to make some good content a lot less visible.

engine

11:18 am on Feb 21, 2023 (gmt 0)

Here's a consideration: A student goes to a library to read books to learn: The books are covered by copyright.
The student writes their own interpretation of information from the books and creates their dissertation.

As long as the student does not create a copy of the content of the book, and interprets it in their own way, that's what the student's tutor would expect, and the examiners would judge.

Tutor = original book/website.
Examiners = consumers/editors/publishers
Student = ChatGPT

motorhaven

10:49 pm on Feb 22, 2023 (gmt 0)

Here's another consideration:

A program on a publicly viewable archive has a license requiring attribution for any derivative works. Then ChatGPT derives code from it, without attribution.

And regarding another point in this thread, apparently disallowing the following in your robots.txt will block it directly, as well as two of it's data sources:

OpenAI
Common Crawl
WebText2

graeme_p

3:01 pm on Feb 25, 2023 (gmt 0)

@engine I think what a lot of people do not understand is that ChatGPT is more like the student (assuming a conscientious student who understands the material and then writes based on their own understanding).

With a spinner you can link particular output to particular source data. So you might be able to (especially if you have access to the code and the input data) link a particular sentence in the output to a particular sentence in the sources.

However, with most ML (especially neural networks) you cannot do this. The inputs train the neural network as a whole. There is not trace able link between any output and any input. This means that either the output is a derivative work of ALL the training data, OR of none of it.

That is why they are claiming that there is a problem with the use of their material as training data, rather than claiming the output is a breach of copyright.

The problem is that this is a far harder claim to pursue. They did not do anything a search engine crawler would not do in terms of making copies of text and processing it internally. This is why they are criticising rather than suing.

@motorhaven, than you for that. it may not be enough. They can probably get material from Bing given MS's involvement, and Google is working on similar software and they already have most of the public web indexed. There are also huge bodies of text that are either very liberally licensed (e.g. Wikipedia) or are out of copyright or not not covered by copyright for other reasons.

I am seriously considering whether to just have a robots.txt that bans all crawlers and rely on other sources for traffic for a new site I think I could do that on.