|
Spiders, Web Crawlers, and Web Content Indexers
|
Overview
This page is designed for operators of web crawlers, spiders, and other web content
indexers and was written on 5/15/2001.
PCFA.ORG welcomes web content indexers to our site. However, the server is a Pentium 166
that's 5 years old. An indexer can create such a load that our server will grind to a halt.
Since we are a non-profit organization, it's not easy for us to come up with the $3,000 it
takes to get a new server to handle those loads. Oh, would you like to send us the $3,000?
;-).
Most of PCFA.ORG's content comes from a PostgreSQL database. We have around 20,000 to 30,000
web pages, based on the number of hits we have received from one indexer that was crawling
us.
General Guidelines
PCFA.ORG request web content indexers to follow these guidelines:
- Please honor '
robots.txt'. Mostly, this means do not get .jpg
or .gif files. But, there are a few directories we don't want you to index,
mostly because they are .html static snapshots of the dynamic pages created
from the databases.
- Don't hit us everyday! Please check each section of our server, described below, and
rotate your crawler to choose one area each week (please treat the two newspapers are two
areas). This will keep your indexes current within about a month, which should be quite
adequate.
- We would love to know when your indexer is scheduled to visit our site. That way, we can
coordinate in order to avoid server delays because of the number of concurrent hits.
Please e-mail the information to
admin@pcfa.org.
- Base Web Site
- Indexer's URL:
http://www.pcfa.org/
- Content: Home page and PCFA.ORG services, the local community (clubs, churches,
schools, government, businesses, etc), genealogy, etc.
- Configuration: Be sure to exclude the URLs listed below!
- Depot Museum
- Indexer's URL:
http://www.pcfa.org/depot_museum/
- Content:
Articles: 2 or 3 are updated or added each month.
Photo books: /photobooks - almost 700 photos,
but will grow to around 3,000 by the end of 2002. Mostly captions with links
to a .jpg photo (do not retrieve the photo).
Cemetery Survey: /cemetery - From the 1950s.
Updated in May 2001 for the first time in 5 years. The "List of Cemeteries"
will let you retrieve less than 100 pages containing the 13,000 records.
I may add an A-Z surname index for 26 additional pages.
- Configuration: The base URL will index the entire set of
Depot Museum web pages.
- Up-to-Date Cemetery Survey
- Indexer's URL:
http://www.pcfa.org/cemetery/
- Content: A few records are updated and a couple dozen added (new deaths)
each month.
- Configuration: Your indexer must sequentially choose from a pick list of
about 100 cemeteries, then follow only the "next" when it appears
to obtain the 22,000+ records. There will be about 300 pages in all.
I may add an A-Z surname index for 26 additional links, another 300 pages.
If the pick list gives too many problems for indexers, I'll try to come up
with a URL parameter to make it easier.
- Newspapers
- Web browser's URL:
http://www.picayune-times.com/
-or - http://www.pcfa.org/newspapers/
- Indexer's URL: See "Weekly Configuration" and "Periodic Configuration" for details.
- Content: Articles from each week's editions of the
Nevada County Picayune and the Gurdon Times back to September 1, 1995.
Records numbers into the tens of thousands and each produces an individual web page.
- Weekly Configuration: On Wednesday morning (eg. around 9:00-11:00 a.m.
Central Time), you may access these two links:
http://www.picayune-times.com/editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=f
http://www.picayune-times.com/editionlist.heitml?pubname=times&pubdate=05-09-2001&txt=f
Notice the parameters pubname=picayune and pubname=times
Notice the parameter shown as pubdate=05-09-2001 and have your crawler
specify the current Wednesday.
Follow the links which include the k.number= parameter to get the
specific story.
http://www.picayune-times.com/showstory.heitml?show=t&k.number=15447&pubname=picayune&headline=...
There will be 10-50 stories for each newspaper each week.
Do not follow the links at the bottom of the page:
editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=f Show List of Stories for ...
editionlist.heitml?pubname=picayune&pubdate=05-09-2001&txt=t All Stories for ...
editionlist.heitml?pubname=picayune List of Past Editions
search.heitml?pubname=picayune Search ... database
- Periodic Configuration: Not more than once a month, you may follow these links:
http://www.picayune-times.com/editionlist.heitml?pubname=picayune
http://www.picayune-times.com/editionlist.heitml?pubname=times
See "Weekly Configuration", above, for "how to" and what to avoid.
These links will provide a list of archived editions, sorted by most current
first.
Time
The best time to index the PCFA.ORG server is between 5:00 a.m. and 12:00 noon Central Time.
Since we do our daily routines, backups, etc. from midnight until around 4:00, please try to
avoid those times. Do not hit us from 3:00 p.m. until 2:00 a.m. during prime time for our ISP and our server.
Penalty
Web content indexers that do not follow these guidelines will be blocked from our
server.
|