Forum Moderators: Robert Charlton & goodroi

Message Too Old, No Replies

Do you wonder to what extent the Googlebot executes JavaScript?

         

analognico

10:35 pm on Jan 18, 2015 (gmt 0)

10+ Year Member



The Googlebot's ability to execute JavaScript is usually perceived as very limited. However, if you do it right even a fully fledged Single Page Application can be crawled.

Roughly speaking the JavaScript code is executed if:
  • it is executed before or on the
    window.load
    event
  • and no AJAX calls are involved.


Since you discussed this topic on this forum before ([webmasterworld.com ]) I thought you might be interested to read the details in my blog post: [bit.ly ]

.

aakk9999

2:39 pm on Jan 19, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello analognico and welcome to WebmasterWorld!

You have published very interesting test results on Google's javascript execution.

We normally do not allow links to personal blogs but in this particular case we are making an exception since the information you have published is not promotional, not available elsewhere and without reading your blog article the discussion on what you found when testing Google's javascript execution in this thread would not be possible.

From your blog post:
The Googlebot does NOT execute JavaScript if either:

  • The code is executed after an AJAX call returns or
  • The code is executed after a timeout.


  • This is an interesting information. I am wondering whether this means that using one of the above techiques for tabbed or "Read more" interface on the page can get around of the issue of Google not indexing the content of not visible tabs or the text "hidden" behind "Read more" ?

    analognico

    3:41 pm on Jan 19, 2015 (gmt 0)

    10+ Year Member



    Hi aakk9999, thanks for making an exception!

    Regarding your question about tabbed pages or hidden text behind "Read more":

    Usually such functionality is implemented by some JavaScript that inserts the content on a click event. Unfortunately, the Googlebot doesn't do clicks. It just loads pages, extracts the links, and follows the links by loading those pages in separate requests.

    A common technique to implement tabs or a read more link is to use respective routes.
    I.e. for a tabbed page with tabs "A", "B", and "C" you would use the routes
    http://mysite.com/mypage#!A
    ,
    http://mysite.com/mypage#!B
    , and
    http://mysite.com/mypage#!C
    . Now, the tabs themselves don't have a
    onClick
    event anymore but instead contain a real
    <a href="...">
    .

    If the page is served to a human it will work like this:
    1. The human opens
      http://mysite.com/mypage
      .
    2. The human clicks on tab A; to be precise on
      <a href="http://mysite.com/mypage#!A">...</a>

    3. The browser will change the URL hash to
      #!A
      according to the anchor the human clicked on.
    4. A routing JS library recognizes the hash change.
    5. The routing lib invokes the JS code that activates tab A according to the hash. (BTW, it invokes the exact same code that otherwise was bound to the
      onClick
      of the tab.)
    6. The human can now see tab A and its content.


    If the page is served to the Googlebot - serving up the identical sources as to a human - it will work like this:

    1. The Googlebot opens
      http://mysite.com/mypage
      .
    2. The Googlebot crawls the page for anchors and finds those for the tabs.
    3. The Googlebot will repeat the following steps for each tab link:
    4. The Googlebot makes a separate request to e.g.
      http://mysite.com/mypage#!A

    5. While the page is loaded the routing JS library is initialized and recognized the hash for tab A.
    6. The routing lib immediately invokes the code that activates tab A according to the hash.
    7. The Googlebot can now crawl tab A and its content.


    BTW, this procedure is exactly the same as if a human would directly open
    http://mysite.com/mypage#!A
    in a browser. So this code is not just implemented for the sake of crawling.

    In my example I used hashbangs. Nowadays you would usually use pushState per default (i.e.
    http://mysite.com/mypage/A
    ) and use a routing lib that falls back to hashes if the browser doesn't support pushState. By design all crawlers support pushState.

    aakk9999

    10:53 pm on Jan 19, 2015 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Thanks on these details.

    The Googlebot can now crawl tab A and its content.

    The problem is that, unless the tab A is the one that is initially "open" (selected) when the page loads, Google would not attribute the crawled content of tab A to /mypage

    My understanding is that in that case the content would be attributed to a new URL /mypage#!A whereas /mypage would have the content from the initially opened tab only.

    I am looking into something directly opposite. What I am interested in whether, with this new knowledge, this can help the situation discussed in this thread:

    Google Not Displaying Hidden Content - Solution ?
    http://www.webmasterworld.com/google/4722381.htm [webmasterworld.com]

    The tabbed content for all tabs is in HTML document all along, but Google can detect that it needs a user action to reveal other tabs and therefore is not attributting the content of these not active tabs to the page even though it can see the content in page HTML.

    Think "Read more" sliders or tabbed interface where content that already exists in HTML is organised in and revealed on click. If the "tab" code can be brought to the page using AJAX then Google would not know that there is piece of code that in essence organise text into tabs, hence it would index all content that is on the page rather than deciding not to index the content just because it is in a tab that is revealed onclick.

    I know that tab interface could be using AJAX exactly as you described and that each tab could get hash-bang URL, but this would mean that the content is attributed to so formed URL that has only tab A content rather than the page that has the complete content.

    Or in other words - if I have a page with 4 tabs and each tab has some content, if the page was using css/javascript (not AJAX) to organise the content into tabs, then Google would index the page and include all of the content (all four tabs).

    This has recently changed so Google will now index only the text that is in the active tab when the page opens. Anything that opens "on click" or using some other user action would not be attributed to the same page.

    To index the content in the other tabs, a webmaster would need to use AJAX as you described above and create a hash-bang URL for each of tabs. This would however result in 4 URLs rather than one stronger URL.

    Any ideas?

    analognico

    11:56 pm on Jan 19, 2015 (gmt 0)

    10+ Year Member



    Very good point. Actually, my perspective usually is to invest as little effort as possible to make my pages accessible to Google. However, I understand your point that you want to have a single URL which contains the content of all tabs and thus hopefully ranks very high in the search results.

    IMO the best way to go would be like so:
    • The website remains JavaScript-driven as discussed above. You may just use hash links for the tabs instead of hashbangs. Google doesn't follow simple hashes.
    • In the header of your index.html you add
      <meta name="fragment" content="!">

    • The JavaScript code is extended as described below.


    The Googlebot will load the page like this:
    1. The Googlebot requests
      http://mysite.com/mypage

    2. The Googlebot sees the meta tag and cancels crawling this page.
    3. The Googlebot requests the page again. This time with a search query:
      http://mysite.com/mypage?_escaped_fragment_=

    4. The routing library of the page will recognize the search query and invokes some special code that renders the page with the contents of all tabs.
    5. The Googlebot can now crawl everything.


    This approach follows Google's AJAX Crawling spec (https://developers.google.com/webmasters/ajax-crawling/). But instead of serving HTML snapshots the original JavaScript-driven page is served to the crawler. This is where the "precomposing" of the SPA - as I describe in my blog article - comes in.

    ergophobe

    3:21 am on Jan 22, 2015 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



    Had to read this and your blog post a couple of times before I wanted to reply.

    I'm still wrapping my head around #! to _escaped_fragment mapping and all that.

    Just as a point of clarification, though, I believe that your tests show that the JQuery document.ready() test would be crawlable (being essentially the DomContentLoaded event), right?

    analognico

    3:32 am on Jan 22, 2015 (gmt 0)

    10+ Year Member



    Correct, jQuery's
    $(document).ready(...)
    hooks either onto the DOMContentLoaded event or falls back to
    window.onload
    . In both cases the Googlebot will execute the handler before it processes the HTML for indexing.