Google Admits GoogleBot can Parse and Execute JavaScript Code on-the-fly - Google Search and SEO forum at WebmasterWorld - WebmasterWorld

Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Google Admits GoogleBot can Parse and Execute JavaScript Code on-the-fly

Brett_Tabke

12:08 pm on Jun 26, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Best Post Of The Month

Rumors of Googles ability to parse javascript go back to when Google first hired away mozilla developers. It was reasoned that if Google could build a browser (done), it could eaisly retrain GoogleBot to parse javascript. The effect would be that Google could interept things such as dynamically built urls and content produced via ajax calls. For the first time, Google has publically admitted that it has the capability.

Yet those looking closely at their server logs may notice that Google is now requesting links that don't appear directly in JavaScript -- links that get put together on the fly and Google could not possible know about unless it could execute at least part of that JavaScript code.

Mark Drummond, chief executive of Wowd, a unique search engine company we profiled in the magazine earlier this year, explains in an email why understanding JavaScript is "a very deep, very hard, and very classic computer science problem."

He explains that the challenge is in figuring out whether or not the JavaScript code ever stops running. "The halting problem is undecidable," he writes. He says there is no known algorithm that can be applied to any program, at any point, and tell whether or not that program continues ad infinitum. The fact has been mathematically proven.

Drummond, whose search company avoids some of these complexities by tapping humans to do its indexing, notes that it would possible to simplify the problem and merely determine, for example, whether or not the Web application has made a data request to Facebook. Presumably, that's what Google is currently doing. [blogs.forbes.com...]

coombesy

1:13 pm on Jun 26, 2010 (gmt 0)

10+ Year Member

From a quality point of view it makes sense that a search engine should be able to run javascript code and check out ajax calls, I for one support the idea.

From an SEO point of view, does this mean that we should now be pulling our 3rd party and JS frameworks from google code? I for one don't do this, as an issue with googles server could then produce display issues with ALL my sites, without me knowing, but i'd notice an issue with my own server alot quicker. Hope this doesn't mean that we'll get marked badly for keeping 3rd parties in the sites document root.

Alcoholico

3:49 pm on Jun 26, 2010 (gmt 0)

10+ Year Member

I would never ever use google's CDN, that's like using their spyware aka analytics, I have all my javascript on a separate server with googlebot being blocked by robots.txt. For the time being googlebot has behaved, I hope not to have to cloak all my javascript.

Seb7

4:15 pm on Jun 26, 2010 (gmt 0)

10+ Year Member

I dont really understand the point, who would put links in JavaScript for GoogleBot to find. A lot of people put links in JavaScript to specifically avoid being crawled, and would you really want Googlebot to start making requests to your ajax urls?

maximillianos

4:30 pm on Jun 26, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

We use javascript for our template header. Cuts down on page bloat and helps our content standout to Adsense bots.

Nice feature that we no longer need to care where our links are.

tedster

6:07 pm on Jun 26, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

who would put links in JavaScript for GoogleBot to find

As I see it, this is not a common problem for the SMB - but rather for an enterprise site with a large web infrastructure. Many of those have always ASSUMED that Google would find JavaScript links and didn't even notice when it wasn't happening. This is the kind of business that will 404 a page with a million natural backlinks when a new version of the product replaces the old successful one.

Some of these enterprise sites even create their main navigation with JavaScript generated menus that work differently for different areas of the site. This gives them much easier maintenance over a global website with widely distributed "ownership" for the various site segments.

Other enterprises may use very complex AJAX calls. I know of one site where thousands of "product support" videos all live at the same base URL. So by not handling JavaScript in depth, Google had to ignore a good chunk of excellent content.

Even further, not every business even thinks in terms of whether their pages can be crawled, especially business that predate the web. And those businesses still thrive without getting search traffic to all their pages, because they are well-branded establishments that have attracted a large and loyal market offline. Google would naturally prefer to access and index all that content whenever they can.

cwnet

8:16 pm on Jun 26, 2010 (gmt 0)

10+ Year Member

Since when is Mark Drummond an official voice for Google?

Very misleading headline for this thread - at best.

With a link to a Forbes blog as an authoritive source?

@ Brett: "For the first time, Google has publically admitted that it has the capability"

Source/Link please!

Sgt_Kickaxe

8:51 pm on Jun 26, 2010 (gmt 0)

cwnet it's implied and as evidence is the fact Googlebot has been making requests that would have required parsing of javascript.

The title is off, I don't see Google admitting it publicly even if the bot leaves proof it does.

youfoundjake

9:20 pm on Jun 26, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google is constantly trying new ideas to improve our coverage of the web. We already do some pretty smart things like scanning JavaScript and Flash to discover links to new web pages,...

[googlewebmastercentral.blogspot.com...]

Of course you will likely have links requiring JavaScript for Ajax functionality, so here's a way to help Ajax and static links coexist:
When creating your links, format them so they'll offer a static link as well as calling a JavaScript function. That way you'll have the Ajax functionality for JavaScript users, while non-JavaScript users can ignore the script and follow the link. For example:

<a href=�ajax.htm?foo=32� onClick=�navigate('ajax.html#foo=32'); return false�>foo 32</a>

[googlewebmastercentral.blogspot.com...]

JavaScript improvements

Google has also been crawling some JavaScript for a while. Primarily, they�ve been extracting very simply coded links. As of today, they�re able execute JavaScript onClick events. They still recommend using progressive enhancement techniques, however, rather than to rely on Googlebot�s ability to extract from the JavaScript (not just for search engine purposes, but for accessibility reasons as well).

Googlebot is now able to construct much of the page and can access the onClick event contained in most tags. For now, if the onClick event calls a function that then constructs the URL, Googlebot can only interpret it if the function is part of the page (rather than in an external script).

Some examples of code that Googlebot can now execute include:

<div onclick="document.location.href='http://foo.com/'">
<tr onclick="myfunction('index.html')"><a href="#"
onclick="myfunction()">new page</a>
<a href="javascript:void(0)" onclick="window.open
('welcome.html')">open new window</a>
These links pass both anchor text and PageRank.

[searchengineland.com...]

I'd say there's more then enough out there about google executing data on the fly.

httpwebwitch

1:45 am on Jun 27, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

fwiw, googlebot has been capable of this for many years, and has been doing it for... a while. I'm not sure how long, because I stopped using JavaScript for cloaking around 2006. At PubCon Boston, The Honorable Mister Cutts confirmed this saying they "could" do it, but they "presently did not".

The bot executes inline JavaScript, and it will execute "onclick" events on clickable elements, like <a>. But AFAIK it still won't submit a <form>, and it doesn't trigger onmouseover, onfocus, onblur, etc. Correct me if I'm wrong.

tedster

2:45 am on Jun 27, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Things have moved on from 2006, Mr. Witch. The very same Mr. Cutts confirmed at PubCon Austin that googlebot is submitting forms -- and even establishing "virtual links" in their webgraph to handle PR flow to URLs they discovered this way.

I've since seen this with several clients, whose URLs were getting search traffic when the only way to crawl to those URLs was a form submission/JavaScript combo.

Other members here have also noticed various googlebot "form submission" gets in their logs, - even with keywords being input to a textbox that were selected from around their website. And sometimes apparent nonsense or even security tests, too.

MichaelBluejay

4:43 am on Jun 27, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How does it help anyone for Google to follow JavaScript links -- webmasters, web users, or Google itself? I JavaScript ad links specifically so bots don't follow them, so I know how many actual humans are clicking on the ads. If bots start following them then I've suddenly got a lot more work to do in filtering out those clicks. Fortunately Google hasn't started following my JS'd ad links yet, but if and when they do I'm going to curse them.

UserFriendly

2:31 pm on Jun 27, 2010 (gmt 0)

10+ Year Member

Why would it be a surprise if Google's bot did assemble JavaScript? Google was never a fan of cloaking, and to successfully punish hidden and disguised links, their bot would have to have been able to execute JavaScript for the last ten years at least.

JavaScript is code that you send to the client, so you should never expect it to add a layer of secrecy or security to your site. It should only ever be used for convenience functionality, never form validation or security.

tedster

3:48 pm on Jun 27, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

MichaelBlueJay - if the only barrier you have to crawling those ad links is JavaScript, I think you should add another layer now, rather than wait. Cursing isn't any good after you've already felt the pain.

You can, for instance, call the script from an external file that is disallowed in robots.txt. I'm pretty sure that is what Matt Cutts recommended a year ago or more, when this JavaScript parsing first started up.

MichaelBluejay

11:37 pm on Jun 27, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

How do I call an onclick handler from an external file?

tedster

12:03 am on Jun 28, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

You've still got code in the anchor element that looks something like this: onClick="javascript:functionA('value');"

1. Define functionA and its variables in an external .js file rather than on the page. Since it needs a name, let's call it animals.js

2. Serve the .js from a dedicated directory, let's call it /goodstuff/

3. Disallow that directory in your robots.txt file.

4. Call that script in the <head> with a script element: <script type="text/javascript">/goodstuff/animals.js</script>

Brett_Tabke

12:33 am on Jun 28, 2010 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

Best Post Of The Month

> their bot would have to have been able to
> execute JavaScript for the last ten years at least.

They have not been able too. Random dynamically built urls via javascript have not been downloaded by GoogleBot until q4 last year.

> The Honorable Mister Cutts confirmed this saying they
> "could" do it, but they "presently did not".

No no no. He said they could "scan" or "parse" the text, but not execute it. Looking for http: in the code is easy - executing complex js (often obfuscated to death like facebooks) is something else entirely.

> I'd say there's more then enough out there about
> google executing data on the fly.

No there is not - there is nothing out there about them executing code until g/io. Nothing in their blog posts, or from SEO's anywhere on the net that I can see. Again, scanning js for http or www is childsp lay - executing that code is entirely different. Looked at some of Facebooks code lately? GoogleBot is now executing it. That started q4 last year just before Pubcon.

Mark_A

9:29 am on Jun 28, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Google is always expanding what it can read / access.

I can remember posting my CV online in a PDF document because I thought the only person who could read it would be humans. Soon enough it started showing up in the SERPS, much to my embarrasment.

rustybrick

1:01 pm on Jun 28, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Yea, I'm with youfoundjake...

I believe they admitted this time and time over the course of the year.

Silvery

3:24 pm on Jun 28, 2010 (gmt 0)

10+ Year Member

tedster is right - quite a few larger sites used Javascripted links to keep bots out -- much as with how submission forms were commonly considered a barrier some time ago. It's now quite difficult to keep any page /out/ of Google's index.

youfoundjake and rustybrick are right -- Google's been stating that they interpret a lot of javascript for quite some time now.

MichaelBluejay asked how it helps anyone -- I think Google would say that it helps them discover many more links/pages that unknowing webmasters unintentionally hid from bots. They'd also say that it's necessary in order for them to discover various cloaking-style exploits and other spammer hacks, and so that they can suppress/warn users of malicious content.

Brett_Tabke makes a good point that there is indeed a difference between parsing Javascript versus executing it, but I think much of this argument is semantic, because Google could and likely was doing a great deal more interpretation on Javascript code than just discovering URLs. True that there's complex code that's difficult to interpret/diagnose, but also true that a great amount of Javascript that's in use is fairly common, repetitive and easy to interpret.

There is a difference in actually executing code as a useragent versus merely interpreting what code is doing in a predictive modelling fashion. The difference is primarily in the potential impacts that executing code would have upon a live application. Otherwise, I simply knowing what an application does and how it affects user-interactions with a site, the difference is primarily semantic -- Google's use of that info would be the same regardless of the method used.

enigma1

8:37 am on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

And what is stopping the server code from not sending the client side scripts to spiders/bots requests? So if you don't want the bots to see js you don't emit it. It's easy enough to identify.

Also with many ajax frameworks the client side scripts are processed on some handler action (ready, click etc) and no active scripting is visible to the HTML page for a spider to follow. It's how it should be so the same HTML content is always used.

MichaelBluejay

10:30 am on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

And what is stopping the server code from not sending the client side scripts to spiders/bots requests?

Maybe that the pages are static and not dynamically generated by scripts/databases? And maybe that the webmasters *do* want bots like Googlebot to see the site the way that users do, but just not follow the onclicks in the <a href>'s?

ppc_newbie

3:24 pm on Jun 29, 2010 (gmt 0)

10+ Year Member

If they have a browser that can execute JS code, they can have the bot execute JS code.

What they can discover by blackbox testing the code with clicks, and queries remains unkown to us mortals.

enigma1

3:41 pm on Jun 29, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

webmasters *do* want bots like Googlebot to see the site the way that users do

which is?

Unless you know with what browser each and every visitor browses your site and what his configuration is at every time, your best bet will be to follow the HTML compatibility. To give you a simple example:

// HTML header section
function popupWindow() {
// javascript code to popup window spider may or may not decode correctly
}

// Body
<a href="#" rel="nofollow" onclick="popupWindow('http://www.example.com/popup_page.html')">click to popup</a>

No, HTML only works when js is enabled spiders can misinterpret the code, not good.

Here is what can work for both.
// HTML header section
<script language="javascript" type="text/javascript" src="some_ajax_framework.js"></script>

// Body
<a href="http://www.example.com/popup_page.html" id="click_id_for_ajax">click to popup</a>

works in both cases when js is on or off, because the click handler is handled externally if js is enabled. And you can initiate the functions for this during "ready" or after the html portion is loaded with a script line. So you avoid cases where a spider misinterprets js code. You also support all users with or without active scripting.

And it is irrelevant if you have dynamically generated pages.

Propools

9:23 pm on Jun 29, 2010 (gmt 0)

10+ Year Member

I have all my javascript on a separate server with googlebot being blocked by robots.txt. For the time being googlebot has behaved

I wish googlebot would respect robots.txt, as they currently don't with ours.

MichaelBluejay

2:00 am on Jun 30, 2010 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Top Contributors Of The Month

Propools, are you sure it's really Googlebot, and not a rogue bot masquerading as Googlebot? (i.e., Have you checked the requesting ip?)