Forum Moderators: phranque

Message Too Old, No Replies

I give up - need help

getting 2 sites crawled and not working

         

curved

7:08 pm on Aug 10, 2011 (gmt 0)

10+ Year Member



I've been building sites a long time and I've never had a problem with a site not getting crawled. There may be a simple problem and I can't see the forest for the trees, but I have 2 websites that are inaccessible to bots for some reason.

[ceramic-tile-montreal.ca...] and [seoservicesmontreal.ca...]

You can go to the sites just fine, navigate them just fine, but google webmaster tools can't fetch the site, xml-sitemaps.com can't crawl the site, websitegrader.com can't access them, etc.

I uploaded the google verification file. I uploaded the robots.txt file. I have 4 types of sitemaps that are all working fine.

Can anyone see the problem? I'll be extremely grateful for any help. I provide content. You help me, I'd be happy to help you. :)

curved

7:19 pm on Aug 10, 2011 (gmt 0)

10+ Year Member



This is what the google fetch says

Fetch as Googlebot
« Go back
This is how Googlebot fetched the page.

URL: [ceramic-tile-montreal.ca...]

Date: Wednesday, August 10, 2011 12:16:17 PM PDT

Googlebot Type: Web

Download Time (in milliseconds): 312

HTTP/1.1 301 Moved Permanently
Date: Wed, 10 Aug 2011 19:16:17 GMT
Server: Apache mod_fcgid/2.3.6 mod_auth_passthrough/2.1 mod_bwlimited/1.4 FrontPage/5.0.2.2635
X-Powered-By: PHP/5.2.17
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
X-Pingback: [ceramic-tile-montreal.ca...]
Set-Cookie: PHPSESSID=f5f8f71ec9c15802287fb64e82d2a919; path=/
Location: [ceramic-tile-montreal.ca...]
Content-Length: 0
Keep-Alive: timeout=8, max=50
Connection: Keep-Alive
Content-Type: text/html; charset=UTF-8

Trucker

2:42 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



I'd say the line below is the problem. You don't want to be passing a 301 if the visitor is already on the proper page.

HTTP/1.1 301 Moved Permanently

curved

4:46 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



Yes, but if you view source, that isn't there. It's not in the htaccess file. Robots.txt is set to allow. There is not 301 redirect written into the source code.

Demaestro

5:06 pm on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Clicking the link this is what I get.

There is a 301 redirect for sure that takes me from:
www.ceramic-tile-montreal.ca
To
ceramic-tile-montreal.ca

Result - Protocol - Host
301 - HTTP - www.ceramic-tile-montreal.ca
200 - HTTP - ceramic-tile-montreal.ca
304 - HTTP - ceramic-tile-montreal.ca
304 - HTTP - ceramic-tile-montreal.ca
304 - HTTP - ceramic-tile-montreal.ca
304 - HTTP - ceramic-tile-montreal.ca

Something is redirecting me off of the www. subdomain. Maybe htaccess, maybe something in your zonefile? It could even be something in your framework if you are using one.

[edited by: Demaestro at 5:09 pm (utc) on Aug 11, 2011]

rocknbil

5:07 pm on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The 301 doesn't come from the source code, often not even from .htaccess - it's server level or sometimes software. Are you using a CMS?

curved

5:46 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



First, I really appreciate these responses and your time. yes, I'm using wordpress.

the htaccess for ceramic-tile-montreal.ca reads like this;

RewriteEngine off
<IfModule mod_suphp.c>
suPHP_ConfigPath /home/ceramict/public_html/php.ini
<Files php.ini>
order allow,deny
deny from all
</Files>
</IfModule>

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

Not sure where the first part comes from. The 2nd part is normal wordpress mod rewrite

But having the same issue on seoservicesmontreal.com and the htaccess only has this;


# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

# END WordPress

curved

5:47 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



I normally do redirect the www version to the non-www version or vice-versa as google suggests, but have not done so on these two sites. I normally use htaccess for that.

Leosghost

8:16 pm on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



<slightly OT>
umm ..just a precautionary word ..there are some horrific "auto translated" French phrases in the ceramic site ( which it would be advisable to correct before it does get crawled and indexed ) eg "salle de montre" ( which means absolutely nothing ..) should be "salle d'exposition"..which is the correct translation of "showroom" ( and occurs 3 times on your "page d'accueil" )..many of your phrases make no sense in French..even Quebecois French ;-)

"Carreaux" is used in some places and in others you use the singular "carreau" whereas it should be the plural "carreaux" in all of them ..there are many more things that you need to "sort out" on that site..in addition to the language.. before allowing it to be crawled.
</slightly OT>

HTH :)..better you get the chance to correct them before the clients or their friends or customers point them out to you or to others.

lucy24

9:58 pm on Aug 11, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



If you're on shared hosting and you've asked them to redirect to with-www or without-www, it will happen before a visitor ever arrives at your htaccess. This makes the logs look a little weird, as bad robots first get politely redirected and then get a 403 slammed in their faces. But what about those logs? Are they utterly devoid of visits from the googlebot? Most people are more likely to ask how to get g### not to crawl their site ;)

curved

10:23 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



leosghost, no autotranslation. someone transalted that for them. we just inserted the content. But will pass that along to them and thank you for pointing that out.

curved

10:25 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



Lucy, thanks for your response, but no request like that was made to the hosting company and each site even has its' own c-class ip address so no real relationship between them.

curved

10:29 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



Once again, I appreciate everyone's help on this issue. Need help with anything content related, I'd be happy to help you in return.

curved

10:30 pm on Aug 11, 2011 (gmt 0)

10+ Year Member



Besides the french stuff. :)

Leosghost

12:07 am on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Now I've looked at the other SEO site..there are similar linguistic and grammatical issues in the French there too there..

someone translated that for them


"Someone" may well have used "autotrans" to do that "translation".and not "let on" .. it bears all the hallmarks of autotrans .."ponderous phrasing and prose, with errors"..

re: your indexing problem ..I took a look at what G finds with either domain name in "quotes"..only 4 results each time ..in both cases this thread scores at #1 ..and two of the other places are obscure MFA/Scraper sites where you appear to have a comment in each "blog" section ..both "no followed" ( WebmasterWorld also "no follows" links.. automatically ) ..one other on an obscure scraper ( again no followed ) whose purpose is to run adbrite and sell paid links to certain sites out of the main scraped area ....

And finally ..one mention on an Indonesian hacker site ..it tells "skiddies" how to DDOS etc ..

All the links ( 4 links per each of your sites ) thus .. from the 8 places in all ( apart from those from this thread ) resolve to 404s on each of your two sites..

Normally Google ( and other SEs) find even sites with no "inbounds" by themselves ..but you might want to actually try "priming the pump" and linking to these two sites from somewhere else that you own and doing it via an actual "followed link".;-)..then see what happens after 72 hours :)

Oh and one result I did get for a word string from your 2nd site ..gave a page on G with a DMCA removal notice attached to it ..but as yet no details at Chilling effects..so no way to know if it was concerning yours.

re: the "skiddies"..your "tile place" says "wordpress" on the bottom proudly in plain view ..and keeps mentioning "formulaire de contact"/ contact forms, sometimes more than once on the same page..with links to said forms ..given the ease with which a "skiddie" can take over wordpress ( if not secured properly ) via "forms"..and the fact that a skiddie site mentions you in G cache ..I'd take away the wordpress from the footer IIWY ..no need to paint a target :)..its like writing "kick me" on the seat of your pants..

Your code ( being auto generated ) is, sorry to say too much to wade through now ..but I note it does expose to all and sundry ( skiddies and webchavs included ) via "source".. the pathway to your "admin area" ..again not a good thing..

Leaving clues as to how to "get in" on wordpress is a good way to wake up to some 64base that your site didn't have the night before..and a world of hurt..

IMHO ..both sites would be better as just vanilla html..with enough jscript to run the slideshow and a secured PHP form mailer..( see some of rocknbil's recent posts for how to sanitise forms..) ..and they'd be lighter, faster ..and rank better.. when eventually indexed.

Oh ..and please ..put its reflection under the pot plant in the header .."floating" there like that ..çà m'angoisse ..

HTH :)

PS..do your analytics actually tell you any robots have been by ?..if so when ..and what did they do whilst they were visiting ..?

PPS ..Bing and thus Yahoo have both indexed both sites ..so there is nothing wrong with your server set up etc ..Yandex and Baidu have not got round to you yet ..get a real "follow" link to each ..and IMHO ( with what I have to go on ) you'll see the other bots soon..

Unusual for G and especially Baidu to be this slow ..but stranger things have happened :)

lucy24

12:34 am on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Oh, wait.
I uploaded the google verification file.

This isn't done in a vacuum. Did you get an immediate confirmation that the site exists and that it belongs to you? That is, does its name now show up in GWT? I don't remember what google says if it can't verify the site, but someone hereabouts will know. Maybe you should just try starting from scratch and verifying again. Delete the file and they'll give you a new one. And then leave it there, because they check periodically.

tangor

2:16 am on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Kill WP. Put up a static HTML page and see if that ranks/found in search...

In other words, start over. Nothing worse than a badly flawed WP startup... and way too much to figure out what went wrong. (Secondary, delete current install/setup and start over)

Life is too short, and sometimes brute force is appropriate.

curved

11:56 am on Aug 12, 2011 (gmt 0)

10+ Year Member



Now we're getting somewhere. Thanks guys. I will try an html page tangor and see what happens. But one thing that is being overlooked. Try crawling either site with websitegrader.com or xml-sitemaps.com, etc. and those bots can't crawl it either, so there is a technical issue. Yahoo and Bing will index a site even if it doesn't have content. :)

curved

11:57 am on Aug 12, 2011 (gmt 0)

10+ Year Member



lucy, I'll try what you said as well

curved

11:58 am on Aug 12, 2011 (gmt 0)

10+ Year Member



Leosghost, I'll let the client know about the french translations. Will also take your suggestions about the paths and the kickme sign. :)

Leosghost

12:32 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Yahoo and Bing will index a site even if it doesn't have content. :)

So will Google..their serps have as many of the "your account is ready for you to upload" and "directory structure" pages as the others do ..

I always put a place holder page ( pure html with inline CSS ) with only the site name ( and an obfuscated contact email ) and a copy of the description up ( and point to it with an inbound followed link from a ten year old site ) as soon as any new domain name is purchased..

Gives one a head start on scoring at #1 for one's own name..and is good for an immediate "heads up" if anything is getting in the way of indexing.

curved

1:06 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



I've seen it in google as well

rocknbil

5:37 pm on Aug 12, 2011 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yeah that 301 is (very likely) coming from within Wordpress, from within the coding itself. I have seen this before - and my apologies, I never really found out **why** it was doing it, and your case may be different, but it had to do with my form actions and a leading slash.. Maybe this will help you figure it out.

I have seen this on Wordpress **and** several other CMS's (which makes me think it's something I'm doing . . . . can't imagine what . . .)

In my scenario, it was submitting a FORM and I'd get a 301 that redirects . . . back to the form. Since it was using a CMS where everything was rewritten from root, I did this:

<form action="my-cms-url">

What fixed it?

<form action="/my-cms-url">

Don't know if that will help but it's something to look at. Note that it's the non-existent CMS URL that was giving me grief, not an actual file reference. For whatever reason, the CMS recognized the URL and 301'ed it back to the orignating page. In my case, forms woudln't submit because it lost all post data.

Are you using the FireFox plugin Live HTTP Headers? It will show you some things.

An aside, (and don't mess with this until you solve the other,) this code is horribly inefficient - it reads the entire file system TWICE, once looking for "URL" as a directory, then as a file. Yeah, it's standard Wordpress, but it's also deployed with every other CMS out there.

# BEGIN WordPress
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule . /index.php [L]
</IfModule>

An Apache expert here shed light for me: instead, if the request does NOT have a dot in it and it's not the wp-admin, redirect to index.php.

# BEGIN WordPress
RewriteEngine On
RewriteBase /
RewriteRule ^index\.php$ - [L]
RewriteCond %{REQUEST_URI} !/wp-admin/
RewriteRule ^([^.]+)$ /index.php [L]

The ifModule is not needed, mod_rewrite will work, or it won't . . . I had an extremely slow WP site that actually showed serious improvement with just this one rule (had tons of images, huge file system.)

curved

6:14 pm on Aug 12, 2011 (gmt 0)

10+ Year Member



Thanks rocknbil. useful info. It's definitely a strange issue. I can ping the domains but not the ips. Hostgator is looking into it. When I check the header info, I get the 200 ok like it should be, but bots are getting the 301 boot. not sure why still