Site Submit › Search engine listings: can spiders index your website?

Can spiders index your website? On-site search engine optimization.

The basis of your confidence that the site will appear in the search engine index base is a "successful" program code. After all, if the robot cannot index your pages, then the search engine cannot include it in its search database.

Unfortunately, many websites use technologies or architectures that make them hostile to search engine crawlers. The search engine robot is really just an automated web browser that has to interpret your page's HTML code just like a regular browser.

But search robots are amazingly slow-witted. Most advanced search engines are believed by many to have achieved development close to version 2.0 of the web browser. This means that the spider cannot understand many web technologies and cannot read some pages. This is especially harmful if these particular parts include some or all of the links on your page. If the spider can't read your links, it can't crawl through all the pages of the project.

As a search engine marketing consultant, I was often asked to rate new sites shortly after they were launched. Search engine optimization is often neglected during the development process. At this time, designers are focused on navigation, usability and branding. As a result, many sites start up with built-in problems. And correcting these problems is much more difficult than not fixing them at the design stage.

And only when the site does not appear in the search engine listings, many companies turn to SEO.

They are ashamed to admit this, because search engines are perhaps the most important source of traffic for small businesses. Almost 85% of Internet users search for websites through search engines. The value of a website that is not search engine friendly drops significantly.

In this article, I'll give an overview of some of the key things that can prevent a search engine crawler from indexing your brainchild. This list is by no means exhaustive, but it can highlight the most common things that will keep spiders from indexing your site.

Links written in JavaScript

JavaScript is a great technology, but invisible to all search engines. If you use JavaScript to control the navigation of your site, spiders can have a serious problem indexing scripts.

It seems that links written in JavaScript are ignored by search bots. And that's right.

For example, imagine you have the following script that redirects the user to a specific page on your site:

This script uses the goToPage() function to add a referral code to the end of the URL before sending visitors to the page.

I have seen websites where every link on the page was written in JavaScript in this way. In some cases JavaScript is used to include a referral code, in others it is used to redirect users to other addresses on the page. But in all cases, the first page of the site was the only one that was in the index base of the search engine.

None of the spiders index the JavaScript link engine. Even if the spider could interpret this script, it would still be difficult for the spider to interpret the various mouse clicks that trigger the goToPage() function with a different direction code.

Spiders will either ignore the contents of the SCRIPT-tag, or read the contents of the script as if it were visible text.

As a general rule, it's best to avoid navigating with JavaScript.

DHTML Menus DHTML

dropdown menus are extremely popular when building a site's navigational structure. Unfortunately, they are also hostile to search engine spiders, as they again have trouble finding links in the JavaScript used to create them.

DHTML menus have an additional problem in that their code is often located in external JavaScript files. While there are good reasons to put the script in an external file, some spiders do not support this mechanism for building a link structure.

If you use DHTML menus on your site and want to see what effect they have on search engines, try turning off JavaScript in your browser - your menu dropdown will disappear and chances are your top menu will disappear with it. Clap! And instantly most of the pages of your site became inaccessible. The same thing happens with search engines.

Address Strings

If you have a dynamic site that uses technologies such as ASP, PHP, Cold Fusion, or JSP, there is a good chance that your URLs include a query string like this:

www.mysite.com/catalog.asp?item= 320&category=23

This can be a problem since many search engine spiders do not index such links that include query strings. This is true even if the page the link points to contains nothing but standard HTML. The URL, by itself, is a barrier to the spider.

Why? Most search engines make a conscious decision not to index query string links because they need an extra record to interpret them. The spiders keep a list of all indexed pages and try to avoid re-indexing a page on a unique visit to the site. They do this by comparing all new URLs against a list of those they have already seen.

Now, let's say the spider sees a URL like this on your site:

www.mysite.com/catalog.asp?category=23&item=320

This URL leads to the same page as our first URL, even though the URLs are not identical (Note that the name/value pairs in the string requests are in different order).

To determine that this URL leads to the same page, the spider must split the query string and store each name/value pair. Then, whenever it sees a URL with the same parent page, it will need to compare its name/value pairs with all previous query strings in the file.

Keep in mind that our example query is quite small, the query string could be much larger. I've seen query strings that were 200 characters long and refer to a dozen different name/value pairs.

So, indexing pages by query strings means a lot of unjustified work for the robot.

Some robots, such as Googlebot, will work with URLs that have a limited number of name/value pairs in the request address. Other spiders will ignore all URLs containing query strings.

Flash Technology

Flash is great, much better than HTML. This is a dynamic and sharp advantage. Unfortunately, spiders use advantage-chasing technology. Remember: Roughly speaking, a search engine spider is equivalent to version 2.0 of a web browser. Spiders are simply unable to interpret the latest technologies such as Flash.

So, even though Flash animations may shock your visitors, they are invisible to search engines. If you're using Flash to spruce up your site a bit, but most of your pages are written in standard HTML, this won't be a problem. But if you've built your entire site using Flash, you'll have a hard time getting it indexed.

Frames

Didn't I mention that search engine spiders use weak technology? That's right, they're so low tech that they don't support frames either. If you use frames, the search engine will be able to crawl through your front page containing FRAME tags. But it won't be able to find the individual FRAME tags that might make up the rest of your site.

In this case, you can at least work on the problem by including NOFRAMES on the first page of your site. This section of your page will be invisible to anyone using a browser that supports frames. On the other hand, this does not prevent you from placing content in the NOFRAMES section that search engines can index into their database.

If you include a NOFRAMES section, take care to include real content there. At a minimum, you should place standard hypertext links (A HREFs) pointing to your individual Frame pages.

Surprisingly, quite often people include a NOFRAMES section that appears to say "This site uses Frame technology. Please improve your browser." If you'd like to experiment, do a Google search for "requires frames." You'll find about 160,000 pages, all of which include the text "this site requires frames." (This site uses Frame technology) Each of these sites has limited search engine visibility.

With www or without www?

My website address is www.keyrelevance.com, but can people access it if they drop "WWW." in the address bar? For most server configurations, the answer is yes, but some say no. Make sure your site works with both www and non-www.

This paper looks at some of the more common reasons that can cause a site to not be indexed. Other factors, such as how you create your web page hierarchy, will also affect how many of your site's pages get indexed by a search engine.

Each of these problems has a solution, and in future articles I will touch on each to help you get more pages indexed.

If you are currently redesigning your site, I want to encourage you to take these notes into account before you breathe life into the site. While each of these search barriers can be removed, it's better to start with search engine friendly development than to fix hundreds of pages after a project is launched.