How the search engine spider works? I though I never knew about it for details but we certainly can figure it out how it crawls our site e.g by going and look through webmaster tools. Setting up a standard domain name or preferred domain is so important if we considering about pagerank. Obviously human visitor still can visit a site whether he used www or without www. But for the spider engine it’s totally different.For the spider mysite.com and www.mysite.com are two different sites and of course two different pages. Or for some pages like mysite.com/index.html and www.mysite.com/index.html. If this two URL are the same for the spider bot, I think we don’t need to verify the site’s ownership twice in Google webmaster tools wouldn’t we? One for the www and another for the non www then we will be able to set the canonical url for domain/homepage.
What the consequence if we not standarize the URL for homepage? I think if we care about it’s page rank we probably loss some pagerank’s value distributed by backlinks or vote from the internal pages of the site itself.Lets have a look,for example I have an internal link or backlink vote for mysite.com and the other for www.mysite.com same case in page also vote for my site.com/index.html or htm or php or whatever the link file extension name is, for the spider they not the same.The non www and the www each one receive the pagerank distributed by backlinks and from the vote from it’s internal page.Now by this,we are waste half ammount of pagerank value just because we don’t standarize the canonical URL.
Another example is I have an index page which is www.mydomain.com/index.html/ This index contains few other links e.g to www.mydomain.com/service.html/ and www.mydomain.com/term-of-use.html/ .These are fine until I added with or without accidentally a page and link it with name mydomain.com/new-page.html/ and this page carried and link back again to mydomain.com/index.html/ Then the spider will think that there is two version of my site,and the pagerank calculation will be distributed for each of them.Clearly this is not good for us,but again if we only care about pagerank.I think every webmaster cares about pagerank,don’t they? I mean who doesn’t anyway ,right? A site with high PR is always good,because the higher it is the more search engine will pay attention to it.
More from this topic...
- Link Building Tips
- Duplicate content and what causes deindex or sandbox
- How To Get More Targeted Links and Traffic
- Choosing Good WordPress Theme for Basic SEO Implementation
- Creating best pages with good content and keyword rich
- Black Hat SEO vs White Hat SEO
- Link building and PageRank – Off Page SEO
- On Page Seo – Importance of keywords
- How To Make Great Page Title
- How To Set Canonical URL : Preffered Domain www or non www

It’s nearly impossible to say how search engine spiders work exactly, except that there are over 200 factors that affect the way spiders see your website’s relevance.