Webmasters who check their server log files will no doubt be familiar with Henry, Mirago's robot (or spider as it is sometimes called). Henry's purpose is to collate information drawn from publicly accessible web pages.
In different countries, Henry goes by different names. In France, Henry is called Henri. In Germany, Henry is Heinrich. The United States knows Henry as Hank. Other countries have other versions of the name or at least as close as is possible.
Henry's function is to behave like a web browser and read web pages. The contents of these pages are analysed, the theme determined and the text plus links extracted. Over a period of time, Henry reads millions and millions of pages. At varying frequencies, the information thus far gleaned is converted into searchable indexes. Once created, the indexes are passed to Q3, The Mirago Query System.
The latest generation of Henry handles pages with frames, deals with redirection and behaves very like a modern web browser. The only major difference is that it gets through rather more pages each second than the average human could hope to digest. Depending upon the time of day, Henry may read and digest several hundred complete pages each second.
Most webmasters are keen for their web pages to be indexed. What they are not so keen on is being overloaded by requests from search engine robots and other types of spider. To be friendly therefore, Henry uses some complex logic to determine in which order to read pages whilst avoiding frequent requests to any individual web site or domain name. In almost all cases, Henry will not read more than one page per minute from a site at maximum. Very often the interval between requests is much longer. Only in the case of particularly large sites will the rate be slightly increased.
Henry fully observes the robots.txt protocol and also the robot 'noindex' and 'nofollow' Meta commands. For specific details, webmasters should review the guidelines on each Mirago site. Support is included for 'Allow:' as well as wildcards in the file specification.
Frames are a particular area of concern in web page design. Historically this has caused web developers a lot of grief. Search engines treated each frame as an individual page. As a result when a person clicked on the search result, they would be redirected to an individual frame out of its normal context. Depending upon site design, this might or might not have been handled. At best web developers could include some script to force a redirection to reload the frame inside its frameset.
Henry resolves this by treating the entire frameset and all its embedded frames as a single page to be indexed. This means that a searcher is never sent to an individual frame but rather to the frameset as a whole. This is simpler from the web developer's perspective and much more desirable from that of someone searching. Henry similarly handles automatic redirections between pages.
Mirago's index of web pages is actually divided into multiple smaller collections of indexes. Different sites are assigned to different collections. Large sites may be split over multiple collections. Each collection has its own update frequency. Sites such as news sites are automatically included in the high frequency collections. Others may fall into the alternate day or weekly collections.
Henry manages all the collections and automatically creates new searchable indexes periodically depending upon the target collection frequency. As soon as indexes are created, they are automatically propagated to the computers which handle the actual index searching.
Every so often curious facts arise from studying the results of Henry's endeavours. One such curiosity came from inspecting the dictionary of words generated by Henry as it read many millions of web pages. It became apparent that there were huge numbers of words each occurring only a few times and containing only the letters a, c, g and t. Subsequent investigation revealed that this was a result of the human genome project. Many web pages publish lists of human genome coding sequences.
|