We have a site that currently lists a bunch of unique widgets: www.example.com/directory/001
*note: we are not e-commerce, but more an encyclopedia type site with millions of unique pages - content quality varies from excellent to thin.
Each widget page has a unique url which we refer to in our sitemaps and using the rel=canonical. Example:
www.example.com/directory/001
www.example.com/directory/002
..and etc.
On various pages we add tracking code (e.g. "?tid=trackingID") so that we can track user behavior. For example, we might link to www.example.com/directory/001 on various pages throughout the website as:
www.example.com/directory/001?tid=trackingA
www.example.com/directory/001?tid=trackingB
and etc.
We have verified our rel=canonical is setup correctly on the www.example.com/directory/001 page, and that our sitemaps are correctly referring to the appropriate url (e.g. www.example.com/directory/001).
Recently we noticed that Google has indexed upwards of three variations of the same url, ignoring the rel=canonical and our sitemap. For example, via a site operator query we've identified that the following are all indexed in Google:
www.example.com/directory/001
www.example.com/directory/001?tid=trackingA
www.example.com/directory/001?tid=trackingB
Now, I've read that this isn't a big problem with smaller sites. But for large scale sites (millions of pages), I have reason to believe this is creating problems:
-We've seen a significant (read 30% drop in indexed pages and it's continuing) steady and consistent drop in total overall pages indexed as reported in search console. Now, this might be because our sitemaps link only to the www.example.com/directory/001 urls, and Search Console would show a large de-index of urls since the www.example.com/directory/001?tid=trackingA urls don't exist in our sitemaps. However, we've seen a steady decrease of traffic correlating directly with our decrease in indexed pages as reported in Google WMT/SE.
-My assumption is that Google has limited number of pages they will index for a large website in the millions of pages, and that having multiple variants of the same page just wastes the opportunity for other unique content.
I've done a sizable amount of research and have found a # of articles (quick Google search on 'duplicate content tracking urls' with other keywords will result in a number of articles) with somewhat conflicting information. Many just say that we are canibalizing our SERPS with the multiple url variants being indexed, but as far as we can tell, our SERPS remain consistent for those pages that do get indexed.
My questions:
- Has anyone run into a similar problem where Google ignores the rel=canonical and how have you dealt with this?
- Can you provide any insight into if my assumption might hold weight? Mainly, that these variants of the same url are just taking up size in our 'index appropriation per website' that may or may not exist with Google's index?
- Would Google finding these new urls with the ?tid tracking make Google de-index the existing 'clean/unique/canonical' url for these new ones? If so, then Google is deciding to not index the same number of urls that they are removing.
Really appreciate any insight or further discussion and clarification on this issue.