I noticed that I have hundreds of pages indexed that don't exist on my website. For instance,
domain.com/blog/wp-content/keyword-keyword.html
They are almost always spammy keywords for drugs and games etc. When I click on these pages, all I see are blank pages with empty code. When I check the server, these pages simply don't exist.
In addition I have many other pages indexed with URL's like this and when I click on them I do end up either on the archive/category/post page but then these are duplicate pages because ideally someone should be able to reach them directly:
domain.com/blog/?m=fcwzvwvr&paged=112
domain.com/blog/?cat=iukuqvfv&paged=3
domain.com/blog/?p=pwivtbwb&paged=180
My wordpress installation is totally up to date and I asked question a while ago in another place with the answer being "google will index any page that returns a 200 status code. if someone links to a non-existant page on your site and it returns a 200 status code, google will likely index it."
This is like rocket science to me, but still I went to check my 404 page and noticed that I did customize my page many years ago with links to my search page and home page. Assuming that it maybe the problem, I cleaned all that up and now have inserted the Google 404 page widget javascript.
Will this do it? What else do I need to do? How do I make sure that whoever is linking to these pages will not be able to achieve this result? Is there a way to test that the 404 page is in compliance? By the way, why would anyone try to link to these pages? I have noticed a lot of Russian and Polish spam websites linking to my plastic surgery website. Why would they do it? I thought getting links was difficult but these people probably mean harm.