Site Map Tutorials
Google has this service for webmasters called "Google Sitemaps". In a nutshell, you provide a XML file with info about your site for the Google Bot site crawler. There are no guarantees with this service that anything good or bad will happen, but you never know, so I decided to figure it out and take the plunge.
There are several site map generator sites on the web - I tried a few, and finally ended up using the one at xml-sitemaps.com. A nice service for the smaller site - it's free for under 500 pages. My problem was, I got tired of having to go over there, start the process, wait, then copy the file to my local web folder, then upload to the server.
So, I started looking for ways to do this myself. There are two main techniques for generating the list of URLs
- Start with the Main page of a site and extract all the <a> tags that link to other pages on the site, then continue until you compile a complete list of pages
- Use the server file system to locate all pages with .htm, .html, .shtml (or whatever else you might want to choose) extensions
Each have pros and cons. With the former, any page that is orphaned in your site - that can't be reached by starting from your main page - will never show up in your sitemap.xml file. That can be good, or bad, depending on your site design. With the latter, the problem is the opposite in that all pages will be included whether you want them to be or not - again, good or bad depending on your site.
As a long time programmer, my feeling is that I have better control over using the file system approach as I can control what shows up on the list and what doesn't - regardless of whether it is linked in my site.
I am using a standard tree traversal from the root folder of the server to build a list of files that have HTML extensions. With this list, I create a new Google compatible 'sitemap.xml' in the root folder.
Another goal with this was to dynamically create my own site map page by using the generated Google sitemap.xml file. My first attempt at this is here Site Map #1 - it's not quite what I was after, but close. This uses a PHP script to read the Google sitemap.xml and display each <loc>value</loc> as an anchor. It's done in the order that the urls appear in the sitemap.xml as created by my other script.
My second Site Map uses another PHP script that reads the sitemap.xml file, opens each file, reads the first 3000 characters into a string, locates the beginning and ending of the <title> tag and extracts the page title. The URL and title are then stacked in an array, sorted on the title, then output as a list of anchors. This is a lot more work, but doesn't seem to take any longer, and I think it's much more meaningful.
Lastly, to make life easy on myself, I created a log-in form so I can just click Create Site Map and provide the correct key and password to generate a new Google sitemap.xml file.
There are tons of options out there - I stumbled on one after I did all of my routines that I think is neat. It's from a guy named Gary White at apptools.com - it's a dynamic site map created on the fly - a bit different than what I was after, since I wanted to use the sitemap.xml file, but a nice looking site map complete with neat little icons and if JavaScript is enabled, the site map is collapse-able.
In these tutorials, I will attempt to explain the routines used to do this on my sites. I'll cover the basics of
- generating the sitemap.xml file
- reading that as XML to create a simple list of links
- expand the simple list of links to use the actual <title> attribute of the page
- create a simple, secure form to access the xml generation routine.