Paperboy: News page HTML source notes
~~~~~~~~ v1.00 (28-Oct-1999)
WIRED NEWS
==========
Root Index page (http://www.wired.com/news/)
----------
Index page contains links to sections[1]: "Business", "Culture",
"Technology", and "Politics". The URLs for each of these sections is
'http://www.wired.com/news/'.lc(section).'/xxxxxx.html'. This last section
appears variable and so will have to be sourced from the index page. The front
page only reproduces stories from these subsections.
Section Index page (eg. http://www.wired.com/news/business/....)
-------------
Contains headlines/summaries between "" with
headlines in between the anchor tags. Next ... pair can be
ignored (contains time). Contents of ... after that contain the
summary. Skip forward to next ... for next story.
Headlines and/or summaries can contain HTML tags.
"Fancy" Story page
-------------
The stories are split in this and so the plain versions should be used -
however this presents a problem as the URLs to the plain version do *not* look
like they can be derived. This page must be searched for the appropriate
"printing" hyperlink (this is the href containing the string "/print/").
Plain Story page
-----------
Story text between "" and "". Story proper
begins at first "" tag (before that is meta content). Remove text
between "
" and "
".
Image "http://static.wired.com/news/images/pix155.gif" is referenced towards
the end of the story and should be removed (or changed to a locally cached
version).
Related links only require prefixing with server name. There are dodgy
tags which should be removed, otherwise Java (by default) generates
an exception whilst trying to display the page.
THE REGISTER
============
Root Index page (http://www.theregister.co.uk/morenews.html)
----------
Contains links to stories in between "" and " | ".
Headlines are only hyperlinks in this section - no summaries available.
Story page (http://www.theregister.co.uk/yymmdd-xxxxxx.html)
-----
Story is between same tags as above and then within "" and ""
within that block.
BBC ONLINE
==========
Root Index page (http://news.bbc.co.uk/text_only.htm)
----------
No need to decode this as each of the sections can be accessed directly.
However, the site says it's currently undergoing a "redesign". The format below
may change...
Section Index page (http://news.bbc.co.uk/low/english/xxx/default.htm,
------------- where xxx = { world, uk, uk_politics, business,
sci/tech, health, education, sport,
talking_point } )
Headlines/summaries between 2nd "
" and "
". Headline/URL is
.... Summary is either then between next
and . HTML tags (comments, extraneous etc.) should be stripped
from both headline and summary.
Story page (http://news.bbc.co.uk/low/english/newsid.....stm)
-----
Story content is between second
and
after that. Images linked will
need rewriting and stored as some form of hash-code in an images directory
(perhaps). Prefetch some images (eg. /furniture/nothing.gif) and if an image
is already present don't refetch. Mmmm, cachy.
PROCESSING API
==============
A class which has the concept of a "current location" allowing the HTML file
to be read through in blocks. Some of its methods would be:
* Get URL which matches a String
* Find/crop to text between tag 1 & tag 2[*], either inclusive or exclusively
* Strip all HTML tags
* Remove text between two tags from middle of block
* Remove a String from the middle of block, either globally or just next
* Reset pointer to start of block
* Move pointer forward to next occurence of a String