Create Site Map Tutorial
UPDATE: Be sure and read the update on page 6 of this tutorial.
Google has this service for webmasters called "Google Sitemaps". In a nutshell, provide a XML file with info about your site for the Google Bot site crawler. There are no guarantees with this service that anything good or bad will happen, but you never know, so I decided to figure it out and take the plunge.
There are several site map generator sites on the web - I tried a few, and finally ended up using the one at xml-sitemaps.com. A nice service for the smaller site - it's free for under 500 pages. My problem was, I got tired of having to go over there, start the process, wait, then copy the file to my local web folder, then upload to the server.
So, I started looking for ways to do this myself. There are two main techniques for generating the list of URLs
- Start with the Main page of a site and extract all the <a> tags that link to other pages on the site, then continue until you compile a complete list of pages
- Use the server file system to locate all pages with .htm, .html, .shtml (or whatever else you might want to choose) extensions
Each have pros and cons. With the former, any page that is orphaned in your site - that can't be reached by starting from your main page - will never show up in your sitemap.xml file. That can be good, or bad, depending on your site design. With the latter, the problem is the opposite in that all pages will be included whether you want them to be or not - again, good or bad depending on your site.
As a long time programmer, my feeling is that I have better control over using the file system approach as I can control what shows up on the list and what doesn't - regardless of whether it is linked in my site.
In this tutorial, I will attempt to explain the routines used to do this on my sites. I am using a standard tree traversal from the root folder of the server to build a list of files that have HTML extensions. With this list, I create a new Google compatible 'sitemap.xml' in the root folder.