Step 5 - Getting Indexed
Introduction
Please don't fall for any offers advertising website submission services, no matter how good
they sound!
A search engine submission company offering to submit your website to 500,000
search engines. Very impressive!
In truth, there aren't even 500,000 search engines, but if you really want to check for
yourself you can visit search engine colossus, They have a very extensive list of search
engines from around the world. The point we are making is Google, Yahoo!, and MSN are the
major players; spending any effort in trying to rank on the other engines is simply not worth
your time, unless your site has a local focus in a specific country. In this case, it may be
worth discovering other popular local search engines and submit your site to them.
In the latest survey of the most popular search engines conducted by Comscore, Google, Yahoo!, MSN and ASK control over 90% of the search referrals in the United States and close
to that figure around the world. Google's dominance is even more impressive in European
countries such as the UK and Germany where Google's search market share is nearly 75%.
Recent survey results of the most popular search engines by Nielsen NetRatings.
Armed with these statistics, you should ask yourself, why on earth you would want to submit
your websites to the other 499,997 search engines? In the last 5 years since Trendmx.com
had been online, we have yet to see a single visitor from one of those obscure search
engines these over hyped submission companies rave about.
How do websites get indexed?
If you have never submitted your website to any of the search engines and found yourself
stumbling upon your site on Google, Yahoo!, or MSN, don't be overly surprised by your
discovery, there is a good reason for it.
The top three search engines have automated the retrieval of web content by following
hyperlinks from site to site. They are called "crawlers", or sometimes referred to as "spiders",
"bots" or "agents". Their job is to index fresh updated content and links. Your website likely
got included in their database because the spiders have discovered a link pointing to your
new domain somewhere on the web. In short, someone had placed a link pointing to your
website on one of their web pages, blogs or may be your site was bookmarked on popular
bookmarking sites such as Delicious.com or Stumbleupon.com.
Most likely nobody asked you for your permission to link to your site, but alas, this is the
nature of the web. Every website has the ability to connect to another via hyperlinks.
Do I have to re-submit a new or changed web page to
get re-indexed?
The crawler based search engines automatically re-index websites daily, weekly or monthly
depending on the website's popularity. The depth of the indexing varies each time the search
engines visit your website. Sometimes they perform a deep crawl of every page and other
times they may just skim your website looking for changed or new web pages. The frequency
of the deep crawls can't be accurately predicted, but search engines perform deep crawls
about every 1-2 months and shallow crawls are performed as often as daily, depending on
how popular the website is and its Google PageRank value. Higher Google PageRank triggers
more frequent updates by Google. We'll tell you more about the Google PageRank and its
effects on your search engine ranking in the next lesson.
Submitting to international versions of Google, MSN and
Yahoo! myth... busted
Sometimes you may come across search engine submission companies making wild claims
about the advantages of their submission services. One of those claims is the advantage of
submitting to international versions of Google, Yahoo!, and MSN. The offer goes something
like this: "get listed in Google UK, Germany, France, Brazil, etc." Sorry to be so blunt here,
but this is a load of "BS."
When you submit your website to Google, Yahoo!, or MSN, your website is automatically
included in all the international versions of their search engines. The claim that you need to
submit your site to international versions of Google, Yahoo and MSN falls into the same
category as submitting your site to five hundred thousand search engines. In the end, it will
only make your wallet thinner and doesn't result in any advantage for your site.
The traditional search engine submission process
Although the traditional method of submitting your website to a search engine in order to get
listed is still viable, there are much faster methods to get your site indexed. The traditional
registration method is known as search engine submission and it's the process of directly
notifying the search engines about your new website by visiting them and submitting your
URL through their web forms.
In order to submit your website, you are required to enter your home page URL only, and
sometimes a brief description of your website. One very important thing to note is you
never have to submit individual pages of your site to crawler based search engines.
Manual or automatic submission?
There are two ways you can submit your site to the search engines using the manual method
outlined above or using some type of automated software tool or online service. Our own
SEO software tool, SEO Studio comes with a free automatic submission tool.
The most obvious benefit to using an automatic submission tool is that it will save time. If
you only want to submit one website to the Google, Yahoo! and MSN you may be able to
accomplish the task in less than an a few minutes. But if you want to submit multiple
websites to dozens of international search engines the time required would be greatly
increased. The SEO Studio submission tool can submit your website details automatically to
multiple search engines simultaneously with the click of a button. Optionally, the SEO Studio
submission tool can create a submission report for future record keeping.
How can we speed up the indexing process?
The average waiting period for a new website to get indexed by the major search engines
can vary from a few days to a month or more. If you get impatient and decide to resubmit
your site over and over again, the search engines may in fact make you wait longer, so don't
do it. The most effective solution to getting your website indexed faster is to gain a few links
from sites that have already been indexed by Google, Yahoo! or MSN.
You may be able to reduce the waiting period for indexation from a few weeks to as little as
48 hours by following these simple steps outlined below:
-Get your web site listed in a local trade organization's online directory or
chamber of commerce website.
-Submit your web site to international and local search engines and
directories.
-Exchange links with other related websites, look for high Google PageRank link
partners.
-Get listed in DMOZ, Yahoo! and other smaller, but high quality directories.
-Create a useful blog on topics related to your website, and ask other websites to
link to your blog.
-Submit your blog's RSS feed to the top search engines. Google, Yahoo! and
MSN use blog feeds to power their blog search engines and when you submit your
blog's RSS feed URL, your site also gets indexed right away.
-Participate in popular forums or community discussions and post your
website URL as part of your signature.
-Create a press release and submit it to an online press release distribution
service such as prweb.com
-Submit your site using Delicious.com or Stumbleupon.com. Both of these
social bookmarking and discovery sites are an excellent way to get links pointing to
your site.
-Submit your site using the Google Site Map will ensure a quick indexing
process by Google of all your important web pages.
Finding cached copies of web pages
A cached copy of the web page is the search engine's view of a page at a specific point in
time in the past. Each cached web pages is time stamped and stored for retrieval, but only
the text portion of the page is being saved by the search engines. All the images and other
embedded object are stripped out prior to archiving.
When the user clicks on a "Cached" web page link, the search engines combine the archived
HTML source code with the server side images and object that are still present on the web
server.
Clicking on the "Cached" link will retrieve the archived copy of the web page as it was
"seen" by Google the last time.
Yahoo! and MSN also provides this very useful function to its users, but Google is the only
one that actually provides a command that you can enter in the search window to look up a
cached copy of a web page like this: cache:your-domain.com.
What are the page size limits for search engine
crawlers?
An often neglected topic pertaining to search engine indexing is how much content search
engine spiders actually index. The search engine crawlers have preset limits on the volume of
text they are able to index per page. This is important, because going over the page size
limit could mean your pages may only be partially searchable.
Each search engine has its own specific page size limits. Familiarizing yourself with these
limits can minimize future indexing and ranking problems. If a web page is too large, the
search engines may not index the entire page. Although the search engines don't publish
exact limits of their spiders' technical configuration, in many instances we can draw definite
conclusions about page size limitations. Various postings by Gooogleguy, on
Webmasterwold.com and our own observations point to a 101 Kilobytes (KB) text limit size
by the Googlebot crawler. At the beginning of 2005 it appeared that Google indexed more
than 101 Kilobytes of individual HTML pages. We continue to see evidence of web pages
larger than 101KB in the search results pages of Google.
Here is an example of a large HTML page found on Google with a search for the keyword
"recipe" Try typing this command into the Google search box: filetype:html recipe. The
screenshot below shows a web page, indexed with over 200 Kilobytes of text.
The evidence of larger than 101KB HTML documents being indexed by Google in
July 2005
Other popular document types found on Google are Adobe's Portable Document File (PDF)
types. The PDF file size indexed by Google is well over the 101 Kilobytes allowed for HTML
document types; we estimate the file size limit for PDF files to be around 2 Megabytes.
The other major search engines, Yahoo! and MSN have similar or even higher HTML, and PDF
file size limits. For example the Yahoo! and MSN spider's page size limit seems to be around
500 Kilobytes. Beyond the well known PDF file format, all major search engines can index
flash, Microsoft Word, Excel, PowerPoint, and RTF rich text file formats.
In the previous section we have discussed how new websites get indexed and how the
submission process works. If we want our websites to rank well, we have to make sure all
the web pages contained within our website get crawled as frequently as possible.
Finding out how many web pages the search engines have indexed is important for two
reasons:
-Web pages that are not indexed by the search engines can never achieve
any ranking. In order to determine a web page's relevance in response to a query,
the search engines must have a record of that page in their database including the
complete page content.
-The number of web pages indexed can influence a website’s overall ranking
score. Larger websites tend to outrank 2-3 page mini websites, providing all other
external ranking factors are equal. A word of caution to those who are thinking of
using automated tools to produce thousands of useless pages to increase a
website's size. This type of "scrapped" content will only ensure your site will be
banned by the engines in no time.
There are very simple ways to check if your website has already been indexed. Go to a
search engine, and simply enter the domain name of your website or some unique text that
can be found on your web page only. This can be your business name or a product name
which is unique to your website.
Finding out if your home page is indexed is very important, but what about all the other
pages on your site? Google, Yahoo!, and MSN provide search commands for finding all the
pages indexed on a specific domain. These search options are accessible on most engines by
clicking on the advanced search option link next to the search box.
Here is a list of commands for the major search engine to find the individual web pages
indexed:
| Search Engine |
Display Indexed Pages Command |
| AltaVista |
domain:your-domain.com |
| AllTheWeb |
domain:your-domain.com |
| AOL Search |
site:your-domain.com your-domain.com |
DMOZ/Open
Directory |
your-domain.com |
| Google |
allinurl:your-domain.com+site:your-domain.com |
| HotBot (ASK Jeeves) |
domain:your-domain.com your-domain.com |
| MSN Search |
site:your-domain.com |
| Teoma |
site:your-domain.com your-domain.com |
| Yahoo! |
site:your-domain.com your-domain.com |
Total indexed pages online reporting tools
The above method for checking indexed pages is just one of many ways we can find out how
many pages got indexed by the search engine crawlers. There are some excellent free online
tools available to check the number of indexed pages on multiple search engines
simultaneously. We are especially fond of Marketleap.com's indexed page checking feature.
Their search engine saturation reporting tool is especially useful since it allows you to enter
up to 5 other domain names to check. This a great way to check side-by-side the number of
web pages indexed compared to your competitors.
Detailed indexed pages reporting with SEO Studio
SEO Studio provides an indexed page checking feature as part of the Links Plus+ tool. The
Links Plus+ tool can extract additional information about each indexed URL, including the
page title, META keywords, META descriptions and Google PageRank. If you enable the
keyword analysis option in the advanced options screen, Links Plus+ can also display the
keyword density of the title and META tags for each individual page.
One of the most valuable features of this tool is its ability to display the Google PageRank of
each indexed page. The Google PageRank is an important ranking factor, and a good
indication how well individual pages can compete for top rankings.
In the next lesson, we will cover a topic called “deep linking” and how individual pages on
your website can rank higher for specific keyword terms than your home page for example.
When you know the PageRank value of individual pages on your website, you can better
target your link popularity campaign to channel PageRank to individual pages and to increase
the number of pages indexed by the search engines.
Controlling search engine spiders with the Robots.txt
file
The major search engines including Google, Yahoo! and MSN use web crawlers to collect
information from your websites. These search engine spiders can be instructed to index
specific directories or document types from your website with the use of the robots.txt file.
This file is a simple text file and has to be placed in the root folder or top-level directory of
your website like this: www.mydomain.com/robots.txt.
The robots.txt file can contain specific instructions to block crawlers from indexing entire
folders or document types, such as Microsoft Word or PDF file formats.
The Robots.txt file has the following format:
User-agent: <User Agent Name or use "*" for all agents>
Disallow: <website folder or file name or use "/" for all files and folders>
In the table below we have provided some instructions on how you can set different crawler
blocking methods in the robots.txt file.
| To do this: |
Add this to the Robots.txt file: |
Allow all robots full access
and to prevent "file not
found: robots.txt" errors |
Create an empty robots.txt file |
Allow all robots complete
access |
User-agent: *
Disallow: |
Allow only the Googlebot
spider access |
User-agent: googlebot
Disallow:
User-agent: googlebot
Disallow: |
Exclude all robots from the
entire server |
User-agent: *
Disallow: / |
Exclude all robots from the
cgi-bin and images directory |
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/ |
Exclude the Msnbot spider
from indexing the
myonlinestore.html web page |
User-agent: Msnbot
Disallow: myonlinestore.html |
Controlling indexing of individual web pages with META
tags
One of the simplest ways to control what search engines index on your website is to insert
code inside your HTML pages META tags section. You can insert robots META tags that can
restrict access to a specific web page or prevent search engine spiders from following text or
image links on a web page. The main part of the robot META tag is the index and follow
instructions.
The robots META tag has the following format:
INDEX,FOLLOW: <Index the page and follow links>
NOINDEX,NOFOLLOW: <Don't Index the page and don't follow links>
The example below displays how we can use the robots META tag to prevent spiders from
indexing or following links on this web page.
<HTML>
<HEAD>
<META NAME="ROBOTS" CONTENT="NOINDEX,NOFOLLOW">
</HEAD>
<BODY>
In the table below we have provided some more examples of how the robots META tags can
be used to block search engine crawlers from indexing web pages and following links.
| To do this: |
Add this to HEAD section of your web pages: |
Allow the web page to
be indexed and allow
crawlers to follow links |
<META NAME="ROBOTS"
CONTENT="INDEX,FOLLOW"> |
Don't allow the web
page to be indexed and
allow crawlers to follow
links |
<META NAME="ROBOTS"
CONTENT="NOINDEX,FOLLOW"> |
Allow the web page to
be indexed, but don't
allow crawlers to follow
links |
<META NAME="ROBOTS"
CONTENT="INDEX,NOFOLLOW"> |
Don't allow caching of
the web page |
<META NAME="ROBOTS" CONTENT="NOARCHIVE"> |
Don't allow the web
page to be indexed and
don't allow crawlers to
follow links |
<META NAME="ROBOTS"
CONTENT="NOINDEX,NOFOLLOW"> |
Indexed pages reporting tools using web server logs
So far we have discussed the methods you can use to find out which pages on your site had
been indexed by the search engines. An alternative and very effective method to report
search engine spider activity on your website is to use your web server logs and some type
of visitor analysis software that can extract search engine crawler data.
Each time a search engine spider requests and downloads a web page, it creates a record in
the web server's log file. Analyzing these large web server logs can not be accomplished
effectively without the use of specific software tools. These tools can aggregate and
summarize large amounts web server log data and report search engine spider activity by
page. Some of the better known visitor statistics analysis tools can provide basic search
engine spider reporting, but a specialized software tool called Robot-Manager Professional
Edition can also help you create Robots.txt files.
When analyzing web server logs we want to know:
-Have there been any search engine spider visits? If there are no spider visits,
there is no chance of getting ranked. If you don't see any sign of crawler activity in your web server logs it's time to get your site indexed by acquiring some high
quality links. Please refer to our guide on Getting Indexed in step 5.
-When was the last spider visit? The more recent the last crawler visits are the
better. Frequent search engine bot visits ensure your new and updated content gets
into the search engine databases and can potentially rank in the search results.
-Did all the major search engine spiders send their crawlers to index our
website? Google, Yahoo! and MSN all have their own crawler agents and they
leave different trails in the web server logs to identify them. Please refer to the list
below to identify the Google, Yahoo! and MSN crawlers.
Googlebot: Mozilla/5.0 (compatible; Googlebot/2.1;
+http://www.google.com/bot.html), Yahoo! Slurp:Mozilla/5.0 (compatible; Yahoo!
Slurp; http://help.yahoo.com/help/us/ysearch/slurp) and the MSNbot:msnbot/1.0
(+http://search.msn.com/msnbot.htm)
-Did the search engine spiders index all of our web pages? We can type the
URL into the search engine's search box, we can use the site command, and we can
also check the spidered pages using a web server log analyzer.
-Did the search engine spiders yield to our Robots.txt file exclusion
instructions? Finding web pages or images that you don't want the public to see
on the search engines can be very upsetting. You should double check your
robots.txt file instructions and perform a search for restricted file URLs and images
on the search engines.
Indexed pages reporting tools using server-side scripts
If you don't want to download large amounts of web server log files to analyze, there is
another method to report search engine spider visits. A free set of web server scripts called
SpyderTrax was developed in PERL by Darrin Ward to track some of the major search
engines spiders, Google, Yahoo!, MSN, AltaVista, AllTheWeb, and Inktomi among others.
Complete installation instructions are available on the website. Once you have inserted a few
lines of code into your web pages, the program starts tracking spider visits automatically.
The screenshot below illustrates how SpyderTrax reports search engine spider visits.
SpyderTrax's search engine crawler activity reporting interface
Conclusion
The crawler based search engines make the webmasters job really easy. Initially, you have
to let the search engines know your site is online by giving them a link to follow or submit
your home page directly. Without any additional work on your part, your site will be visited
over and over again by the search engine robots to find any new pages or updated content.
The search engines give us the tools to find out which individual pages are in their
databases. We also have a great amount of control what parts of our website we want them
to index with the use of the robots file and META tags on individual web pages.