Forum Moderators: open

Message Too Old, No Replies

MySQL, should I use MEDIUMTEXT or LONGTEXT?

         

csdude55

5:15 am on Mar 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I'm working on the table for my message board posts. I don't really limit the size of the post, in part because I use Rich HTML. A lot of times someone will copy and paste an article from another website, and the coding that they copy can get pretty long (and completely irrelevant). I try to auto-strip most of it, but it doesn't catch everything.

To date, the longest post was 4,515,609 characters, but that was an anomaly that shouldn't happen again. It's possible, of course, but it shouldn't.

Other than that, the largest post was around 415,000 characters.

So I can realistically use MEDIUMTEXT for the column type, no problem. But the question is, is there any actual reason to use it instead of LONGTEXT? Medium requires 3 bytes of overhead while Long requires 4... but that 1 byte seems pretty insignificant in the grand scheme of things.

Would mediumtext return a search query faster or anything?

lammert

9:56 am on Mar 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A lot of times someone will copy and paste an article from another website, and the coding that they copy can get pretty long (and completely irrelevant). I try to auto-strip most of it, but it doesn't catch everything.

I'll try to rephrase your question in my own language:

  1. My sites monetize on scraped contents of other websites
  2. The content is copied verbatim and I am worried that 4 MB is not enough to hold the largest pages
  3. Do you colleague webmasters have pages larger than 4MB which require me to increase my database field size in order to have them scraped properly?

Don't expect an answer from me on this one.
Don't expect an answer from me on any future question you ask here.
You have pretty much disqualified yourself and exposed your real intent.

Dimitri

11:17 am on Mar 20, 2020 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member Top Contributors Of The Month



A lot of times someone will copy and paste an article from another website

hmm...

By the way, beside the scarping aspect, it's not good idea to store and output complex html posted by third parts. The code can contain malicious code of all kinds, dangerous for the visitors, the members, and your server.

csdude55

5:39 pm on Mar 20, 2020 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@lammert, that's a terrible misunderstanding. It's obvious that you've been a victim of some such thing so you're projecting, but that's not the case here at all.

In my message board, a common discussion lately is, obviously, politics and the coronavirus. So in the discussions, people regularly copy and paste from CNN, Fox, Facebook, even Gmail. The large 4M post was someone copying an image from Facebook.

This isn't API scraping, it's simply using articles from news sites as a source and discussion topic. And we require users to provide a link to the source, much like how WebmasterWorld does:

[webmasterworld.com...]

By the way, beside the scarping aspect, it's not good idea to store and output complex html posted by third parts. The code can contain malicious code of all kinds, dangerous for the visitors, the members, and your server.

I try to automagically remove it the best that I can, but it's generally me just playing catch-up with whatever code the user's browser decides to throw in. Here's an example of something that made it through last year:

<div id="m_1268833377758258791m_-502332993039140378gmail-m_5463551879754637323m_-2788003690038536215m_-6275170782920768856gmail-m_-7273562805246369225ydp1d285bb6yiv7531382894aolmail_m_-6558918472869887338m_-7574514109738156460yiv7807237563yui_3_16_0_ym19_1_1501813687895_18094" style="color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: large;"><div class="m_1268833377758258791m_-502332993039140378gmail-m_5463551879754637323m_-2788003690038536215m_-6275170782920768856gmail-m_-7273562805246369225ydp1d285bb6yiv7531382894aolmail_m_-6558918472869887338m_-7574514109738156460yiv7807237563MsoNormal" style=" background-repeat: repeat; background-size: auto; margin-left: 0.5in;"><span class="m_1268833377758258791m_-502332993039140378gmail-m_5463551879754637323m_-2788003690038536215m_-6275170782920768856gmail-m_-7273562805246369225ydp1d285bb6yiv7531382894aolmail_m_-6558918472869887338m_-7574514109738156460yiv7807237563"><b><span style="font-family: Arial, sans-serif; color: rgb(29, 33, 41); font-size: 14pt;">


Since I want the user to be able to copy and paste from my own site, now the programs reads my stylesheets to create an array of selectors that I actually use, and it removes anything else. I also have a list of "approved" inline styles, and it removes anything outside of the approved list. So while those classes and IDs should be removed now, the color: rgb(34, 34, 34); font-family: Arial, Helvetica, sans-serif; font-size: large; would still be there.