Forum Moderators: phranque

Message Too Old, No Replies

How do I handle these ~90K internal and external links?

Need help handling ~90K links in an old site

         

TheBigK

12:49 pm on Jan 3, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



I have a ~15 year old forum site that has a ton of user-generated content. Ahrefs informed me that there are plenty of broken internal and external links. In order to fix this site, I extracted all the links contributed by members throughout the year in a Google Sheet - and there are ~90K of them. I need help handling these.

I wish to preserve only the high-quality internal and external links. I'm trying to figure out how to achieve that. Here's my thinking process so far -

1. Look for .gov, .edu links and make them 'follow' (currently nofollow).
2. Look for links to wikipedia.com and make them follow
3. Look for internal, valid links and make them follow.

Delete all the other links from the text.

Can someone comment whether this is a good strategy? If not, I welcome your suggestions.

tangor

2:40 pm on Jan 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Delete (or fix) the broken links first, then see what you have left. Might not be as many as thought.

engine

3:09 pm on Jan 3, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Wow, that's a lot of broken links.

If you have them in a spreadsheet you could sort them, depending upon the parameters you've set.

Internal links are where i'd start as it's important the site itself looks well-managed.

Sgt_Kickaxe

3:28 pm on Jan 3, 2023 (gmt 0)



Check for redirects too while you're at it, not just broken links. A redirect(301) to adult content, for example, is probably worse than a simply broken link(404).

There is an entire industry out there looking for backlinks to re-purpose and/or redirect.

TheBigK

5:06 pm on Jan 3, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



Thank you for the responses. The main problem I'm facing is the volume of the links I'm dealing with. I've all the links in a spreadsheet - which looks like this: [snip]
There are 90K of these records. The ones that point to www.[snip].com are the internal links.

As I mentioned, my new goal is to only keep -
1. Valid internal links
2. High-value external links (.gov, .edu and some popular .com domains).

I'm wondering what are my best options to check for the broken links and redirects. Is there any service / software you'd recommend to do this? Please let me know.

[edited by: phranque at 11:58 pm (utc) on Jan 3, 2023]
[edit reason] specifics [/edit]

mack

5:25 pm on Jan 3, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



For something like this, I would be tempted to use server-side scripting and a database to access each link and return a response code. 200 would mean the link is still valid, 301/302 would mean it's redirecting and 404 would mean it's currently broken. I would then set about removing the 404 links (if they are still broken) and then check redirects to ensure they are not leading to content you don't want to be linking to (domains being bought up after expiry for example).

I would probably do this using PHP/MySQL. Import your spreadsheet into the DB then use curl to determine the server code... Loop through each row and populate each entry with the response code. This is just a general idea and the specifics of how to achieve this go well beyond the scope of this forum. The PHP forum might be a good suggestion if you do try this and have any problems or questions.

Mack.

lucy24

6:56 pm on Jan 3, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



How many, if any, of the links are ever actually used, or have ever been used, by human readers? (This is one of the very few things I ever use analytics--as opposed to raw logs--for. When people leave a site, is it to go somewhere I sent them, or do they just leave?)

phranque

12:15 am on Jan 4, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Delete all the other links from the text.

what are you doing if the remaining anchor text is, for example, "Click here."?
I extracted all the links contributed by members throughout the year in a Google Sheet - and there are ~90K of them

Wow, that's a lot of broken links.

as i understand that's 90K links in 2022 UGC, internal/external, 200/404/3xx/4xx/etc
I'm wondering what are my best options to check for the broken links and redirects.

i would suggest that you feed that list of urls into a simple crawler such as xenu linksleuth.

TheBigK

5:06 am on Jan 4, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



@mack - Yes, I've written a php code to run through all the links and find out the 200 status. I'll post about my observations.
@lucy24 - 100% of these links are contained in archived discussions. Only 5% of these discussions receive traffic. About 15 months ago, I had decided to drop all the unused content and let the pages 404. But it looks like that was not a good move. Ahrefs dropped my site's health score to 3 :-/ . I've therefore decided to restore all the old pages from database backup and at least have a healthy internal linking structure.

@phranque - I've extracted all the <a> tags. Which means, when I delete them, the entire <a> tag will be removed; so there won't be any anchor text. Let me know if you have comments about this. My worry is that Google may not like a massive change like this all of a sudden.

Yes, 90K links in 2022 UGC. I'm going to run a script to find out the http status code they return. Should help me get rid of majority of these links. Looks like Xensu Linksleuth is available only for Windows. I only have Mac machines with me. My own crawler should do the job.

lucy24

5:53 am on Jan 4, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



the entire <a> tag will be removed; so there won't be any anchor text
Reasonable, if the anchor text was in the nature of “click here”. But it's a bit hard on the reader if the linking text was an integral part [webmasterworld.com] of the surrounding prose.

When I asked about the links being used, I didn't mean do they appear on publicly accessible pages; I meant, does anyone really follow them?

TheBigK

6:12 am on Jan 4, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



@lucy24 - Well, I'm not sure if people actually follow them. For the links, I'm also going to fix the anchor text to the best of my ability. It's going to involve a ton of manual work; but I think it'll be worth it. I can't keep my site's health score at 3. It needs to be 90+.

phranque

6:29 am on Jan 4, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



when I delete them, the entire <a> tag will be removed; so there won't be any anchor text. Let me know if you have comments about this.

my "click here" was a bad example in that case.
aren't the contents of the anchor element typically the most relevant text?

I only have Mac machines with me.

in a *nix environment, i would simply use lwp-request from the command line for this purpose.

TheBigK

7:05 am on Jan 4, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



@phranque - Yes, anchor text is most important. But if I'm deleting the entire <a> tag, there won't be any traces of the bad links in the text. As I mentioned, I'm going to manually fix the anchor text for the links that are relevant and useful.

mack

6:03 pm on Jan 4, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



This isn't a perfect solution, but if your website has a site search you could point the links you remove to the search script with the anchor text as the search term. It will work for those instances where removing the entire link and anchor will affect the readability of the page.

Mack.

TheBigK

4:15 am on Jan 5, 2023 (gmt 0)

10+ Year Member Top Contributors Of The Month



@mack - That is a good suggestion. The pages to which these backlinks point (both internal and external) are only to improve site health. Only ~2-5% of these web pages actually receive traffic from Google. That's the reason I had decided to drop all those pages; but I had no clue how to fix the internal links to these pages. Now that I've figured out a way to extract the links and fix the linking structure; my best bet is to eliminate them altogether where the link points to a non-existing page. It'll enhance the user experience.

It's an experiment in SEO. I've already lost ~95% of site's traffic and it can't get worse. I hope Google will appreciate these changes.

PS: Let me know if you think this could be a serious mistake.

buckworks

7:16 pm on Jan 5, 2023 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



>> I only have Mac machines with me.

For Mac, I like Integrity and Scrutiny link checkers from Peacock Media.

tangor

8:16 pm on Jan 5, 2023 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now that I've figured out a way to extract the links and fix the linking structure; my best bet is to eliminate them altogether where the link points to a non-existing page. It'll enhance the user experience.


That's not an experiment in SEO, that's practical commonsense. If a link is of no value, make it disappear. If a page on your site has lost its value, archive it or make it disappear.

Most folks call this house cleaning. All this web linking stuff sounds neat, but if 60 minutes after setting it that link gets broken (and makes YOU look bad) why have it in the first place. Pick and choose which links are of value and that will increase the VALUE your site offers TO THE USER.

Your thoughts on gov edu links is okay. Filter for those and I suspect you won't have all that much to work on. AND, if any of those are broke, they aren't worth keeping in the final.