New: Content analysis and Sitemap details, plus more languages

Posted by Jonathan Simon - December 13, 2007 on 3:48 pm | In Google Web Central, crawling and indexing, sitemaps, webmaster tools | No Comments Written by Jonathan Simon, Webmaster Trends Analyst

We're always striving to help webmasters build outstanding websites, and in our latest release we have two new features: Content analysis and Sitemap details. We hope these features help you to build a site you could compare to a fine wine -- getting better and better over time.

Content analysis

To help you improve the quality of your site, our new content analysis feature should be a helpful addition to the crawl error diagnostics already provided in Webmaster Tools. Content analysis contains feedback about issues that may impact the user experience or that may make it difficult for Google to crawl and index pages on your site. By reviewing the areas we've highlighted, you can help eliminate potential issues that could affect your site's ability to be crawled and indexed. This results in better indexing of your site by Google and other search engines.

The Content analysis summary page within the Diagnostics section of Webmaster Tools features three main categories. Click on a particular issue type for more details:

  • Title tag issues
  • Meta description issues
  • Non-indexable content issues

content analysis usability section

Selecting "Duplicate title tags" displays a list of repeated page titles along with a count of how many pages contain that title. We currently present up to thirty duplicated page titles on the details page. If the duplicate title issues shown are corrected, we'll update the list to reflect any other pages that share duplicate titles the next time your website is crawled.

Also, in the Title tag issues category, we show "Long title tags" and "Short title tags." For these issue types we will identify title tags that are way too short (for example "IT" isn't generally a good title tag) or way too long (title tag was never intended to mean <insert epic novel here>). A similar algorithm identifies potentially problematic meta description tags. While these pointers won't directly help you rank better (i.e. pages with <title> length x aren't moved to the top of the search results), they may help your site display better titles and snippets in search results, and this can increase visitor traffic.

In the "Non-indexable content issues," we give you a heads-up of areas that aren't as friendly to our more text-based crawler. And be sure to check out our posts on Flash and images to learn how to make these items more search-engine friendly.


content analysis crawlability section


Sitemap details page

If you've submitted a Sitemap, you'll be happy when you see the additional information in Webmaster Tools revealing how your Sitemap was processed. You can find this information on the newly available Sitemap Details page which (along with information that was previously provided for each of your Sitemaps) shows you the number of the pages from your Sitemap that were indexed. Keep in mind the number of pages indexed from your Sitemap may not be 100% accurate because the indexed number is updated periodically, but it's more accurate than running a "site:example.com" query on Google.

The new Sitemap Details page also lists any errors or warnings that were encountered when specific pages from your Sitemap were crawled. So the time you might have previously spent on crafting custom Google queries to determine how many pages from your Sitemap were indexed, can now be spent on improving your site. If your site is already the crème de la crème, you might prefer to spend the extra free time mastering your ice-carving skills or blending the perfect eggnog.

Here's a view of the new Sitemap details page:


Sitemaps are an excellent way to tell Google about your site's most important pages, especially if you have new or updated content that we may not know about. If you haven't yet submitted a Sitemap or have questions about the process, visit our Webmaster Help Center to learn more.

Webmaster Tools now available in Czech & Hungarian

We love expanding our product to help more people and in their language of choice. We recently put in effort to expand the number of Webmaster Tools available languages to Czech and Hungarian, in addition to the 20 other languages we already support. We won't be stopping here. Our desire to support even more languages in the future means that if your language of choice isn't currently supported, stay tuned -- there'll be even more supported languages to come.

We always love to hear what you think. Please visit our Webmaster Help Group to share comments or ask questions.

 



Dealing with Sitemap cross-submissions

Posted by Mickey Kataria, Google Zürich - October 25, 2007 on 11:57 am | In Google Web Central, sitemaps, webmaster tools | No Comments Posted by Mickey Kataria, Google Zürich

Since the launch of Sitemaps, webmasters have been asking if they could submit their Sitemaps for multiple hosts on a single dedicated host. A fair question -- and now you can!

Why would someone want to do this? Let's say that you own www.example.com and mysite.google.com and you have Sitemaps for both hosts, e.g. sitemap-example.xml and sitemap-mysite.xml. Until today, you would have to store each Sitemap on its respective host. If you tried to place sitemap-mysite.xml on www.example.com, you would get an error because, for security reasons, a Sitemap on www.example.com can only contains URLs from www.example.com. So how do we solve this? Well, if you can "prove" that you own or control both of these hosts, then either one can host a Sitemap containing URLs for the other. Just follow the normal verification process in Google Webmaster Tools and any verified site in your account will be able to host Sitemaps for any other verified site in the same account.

Here is an example showing both sites verified:

And now, from a single host, you can submit Sitemaps for both sites without any errors. sitemap-example.xml contains URLs from www.example.com and sitemap-mysite.xml contains URLs from mysite.google.com but both now reside on www.example.com:
We've also added more information on handling cross-submits in our Webmaster Help Center.
For those of you wondering how this affects the other search engines that support the Sitemap Protocol, rest assured that we're talking to them about how to make cross-submissions work seamlessly across all of them. Until then, this specific solution will work only for users of Google Webmaster Tools.

 



Introducing Code Search Sitemaps

Posted by Mickey Kataria, Google Zürich - October 18, 2007 on 2:59 pm | In Google Web Central, sitemaps, webmaster tools | No Comments Written by Mickey Kataria, Google Zürich

The Sitemaps team is continuing its trend of extending the Sitemap Protocol for specific products and content types. Our latest work with the Google Code Search team now enables you to create Sitemaps that contain information about public source code you host and would like to include in Code Search. There's more information about this new functionality on the Google Code blog. If you're eager to get going, take a look at our Help Center documentation, create a Code Search Sitemap, sign into Google Webmaster Tools, and submit a Sitemap for Code Search!

 



Google, duplicate content caused by URL parameters, and you

Posted by Maile Ohye - September 12, 2007 on 3:13 am | In Google Web Central, crawling and indexing, sitemaps | No Comments Written by Maile Ohye

How can URL parameters, like session IDs or tracking IDs, cause duplicate content?
When user and/or tracking information is stored through URL parameters, duplicate content can arise because the same page is accessible through numerous URLs. It's what Adam Lasnik referred to in "Deftly Dealing with Duplicate Content" as "store items shown (and -- worse yet -- linked) via multiple distinct URLs." In the example below, URL parameters create three URLs which access the same product page.

(click to enlarge)

Why should you care?
When search engines crawl identical content through varied URLs, there may be several negative effects:

1. Having multiple URLs can dilute link popularity. For example, in the diagram above, rather than 50 links to your intended display URL, the 50 links may be divided three ways among the three distinct URLs.

2. Search results may display user-unfriendly URLs (long URLs with tracking IDs, session IDs)
* Decreases chances of user selecting the listing
* Offsets branding efforts


How we help users and webmasters with duplicate content
We've designed algorithms to help prevent duplicate content from negatively affecting webmasters and the user experience.

1. When we detect duplicate content, such as through variations caused by URL parameters, we group the duplicate URLs into one cluster.

2. We select what we think is the "best" URL to represent the cluster in search results.

3. We then consolidate properties of the URLs in the cluster, such as link popularity, to the representative URL.

Consolidating properties from duplicates into one representative URL often provides users with more accurate search results.


If you find you have duplicate content as mentioned above, can you help search engines understand your site?
First, no worries, there are many sites on the web that utilize URL parameters and for valid reasons. But yes, you can help reduce potential problems for search engines by:

1. Removing unnecessary URL parameters -- keep the URL as clean as possible.

2. Submitting a Sitemap with the canonical (i.e. representative) version of each URL. While we can't guarantee that our algorithms will display the Sitemap's URL in search results, it's helpful to indicate the canonical preference.


How can you design your site to reduce duplicate content?
Because of the way Google handles duplicate content, webmasters need not be overly concerned with the loss of link popularity or loss of PageRank due to duplication. However, to reduce duplicate content more broadly, we suggest:

1. When tracking visitor information, use 301 redirects to redirect URLs with parameters such as affiliateID, trackingID, etc. to the canonical version.

2. Use a cookie to set the affiliateID and trackingID values.

If you follow this guideline, your webserver logs could appear as:

127.0.0.1 - - [19/Jun/2007:14:40:45 -0700] "GET /product.php?category=gummy-candy&item=swedish-fish&affiliateid=ABCD HTTP/1.1" 301 -

127.0.0.1 - - [19/Jun/2007:14:40:45 -0700] "GET /product.php?item=swedish-fish HTTP/1.1" 200 74

And the session file storing the raw cookie information may look like:

category|s:11:"gummy-candy";affiliateid|s:4:"ABCD";

Please be aware that if your site uses cookies, your content (such as product pages) should remain accessible with cookies disabled.


How can we better assist you in the future?
We recently published ideas from SMX Advanced on how search engines can help webmasters with duplicate content. If you have an opinion on the topic, please join our conversation in the Webmaster Help Group (we've already started the thread).