Prescott Community Freenet Association Prescott, Arkansas: Saturday, May 17, 2008
Spiders, Web Crawlers, and Web Content Indexers
Home   Community   Genealogy   Comments   Help Us!   Privacy?   Other Domains  
 In memoriam, Charles Ray Cross

Overview

This page is designed for operators of web crawlers, spiders, and other web content indexers and was written on 5/15/2001.

PCFA.ORG welcomes web content indexers to our site. However, the server is a Pentium 166 that's 5 years old. An indexer can create such a load that our server will grind to a halt. Since we are a non-profit organization, it's not easy for us to come up with the $3,000 it takes to get a new server to handle those loads. Oh, would you like to send us the $3,000? ;-).

Most of PCFA.ORG's content comes from a PostgreSQL database. We have around 20,000 to 30,000 web pages, based on the number of hits we have received from one indexer that was crawling us.

General Guidelines

PCFA.ORG request web content indexers to follow these guidelines:
  1. Please honor 'robots.txt'. Mostly, this means do not get .jpg or .gif files. But, there are a few directories we don't want you to index, mostly because they are .html static snapshots of the dynamic pages created from the databases.
  2. Don't hit us everyday! Please check each section of our server, described below, and rotate your crawler to choose one area each week (please treat the two newspapers are two areas). This will keep your indexes current within about a month, which should be quite adequate.
  3. We would love to know when your indexer is scheduled to visit our site. That way, we can coordinate in order to avoid server delays because of the number of concurrent hits. Please e-mail the information to admin@pcfa.org.
    Base Web Site
    Indexer's URL: http://www.pcfa.org/
    Content: Home page and PCFA.ORG services, the local community (clubs, churches, schools, government, businesses, etc), genealogy, etc.
    Configuration: Be sure to exclude the URLs listed below!

    Depot Museum
    Indexer's URL: http://www.pcfa.org/depot_museum/
    Content:
        Articles: 2 or 3 are updated or added each month.
        Photo books: /photobooks - almost 700 photos, but will grow to around 3,000 by the end of 2002. Mostly captions with links to a .jpg photo (do not retrieve the photo).
        Cemetery Survey: /cemetery - From the 1950s. Updated in May 2001 for the first time in 5 years. The "List of Cemeteries" will let you retrieve less than 100 pages containing the 13,000 records. I may add an A-Z surname index for 26 additional pages.
    Configuration: The base URL will index the entire set of Depot Museum web pages.

    Up-to-Date Cemetery Survey
    Indexer's URL: http://www.pcfa.org/cemetery/
    Content: A few records are updated and a couple dozen added (new deaths) each month.
    Configuration: Your indexer must sequentially choose from a pick list of about 100 cemeteries, then follow only the "next" when it appears to obtain the 22,000+ records. There will be about 300 pages in all. I may add an A-Z surname index for 26 additional links, another 300 pages. If the pick list gives too many problems for indexers, I'll try to come up with a URL parameter to make it easier.

    Newspapers
    Web browser's URL: http://www.picayune-times.com/ -or - http://www.pcfa.org/newspapers/
    Indexer's URL: See "Weekly Configuration" and "Periodic Configuration" for details.
    Content: Articles from each week's editions of the Nevada County Picayune and the Gurdon Times back to September 1, 1995. Records numbers into the tens of thousands and each produces an individual web page.
    Weekly Configuration: On Wednesday morning (eg. around 9:00-11:00 a.m. Central Time), you may access these two links:
        http://www.picayune-times.com/editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=f
        http://www.picayune-times.com/editionlist.heitml?pubname=times&pubdate=05-09-2001&txt=f
    Notice the parameters pubname=picayune and pubname=times
    Notice the parameter shown as pubdate=05-09-2001 and have your crawler specify the current Wednesday.
    Follow the links which include the k.number= parameter to get the specific story.
        http://www.picayune-times.com/showstory.heitml?show=t&k.number=15447&pubname=picayune&headline=...
    There will be 10-50 stories for each newspaper each week.
    Do not follow the links at the bottom of the page:
        editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=f Show List of Stories for ...
        editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=t All Stories for ...
        editionlist.heitml?pubname=picayune List of Past Editions
        search.heitml?pubname=picayune Search ... database
    Periodic Configuration: Not more than once a month, you may follow these links:
        http://www.picayune-times.com/editionlist.heitml?pubname=picayune
        http://www.picayune-times.com/editionlist.heitml?pubname=times
    See "Weekly Configuration", above, for "how to" and what to avoid.
    These links will provide a list of archived editions, sorted by most current first.

    Time

    The best time to index the PCFA.ORG server is between 5:00 a.m. and 12:00 noon Central Time. Since we do our daily routines, backups, etc. from midnight until around 4:00, please try to avoid those times. Do not hit us from 3:00 p.m. until 2:00 a.m. during prime time for our ISP and our server.

    Penalty

    Web content indexers that do not follow these guidelines will be blocked from our server.


    Prescott Community Freenet Association 

    This page is copyrighted ©1995-2008 by the Prescott Community Freenet Association and Danny Stewart and was served on Saturday, May 17, 2008 at 12:04 (central time). It has been accessed 4,983 times since 05/15/2001. URL: http://www.pcfa.org/pcfa/spiders.heitml

    Click here to visit the PCFA.ORG Home Page.  
Photo is of mural near corner of West Main and 
West First in Prescott and is courtesy 
of Ragsdale Printing Co., Inc.
    IOCC.COM provides the Internet connection to PCFA.ORG 
    PCFA.ORG uses Redhat Linux. Linux since April 25, 1995! PCFA.ORG web server uses Apache. Heitml extends HTML on many pages. Some databases use PostgreSQL. Bank of Prescott furnishes space. IOCC.COM provides internet service.