Revision date: Jan. 2, 2007
HTMLDoc is an encapsulation of an HTML document, with a simple
permissive parser which can handle even most of the bad, non-compliant HTML
documents of the real world.
First, you have to load the HTML document using the load method. Once you
have the document, you can use both low-level and high-level methods to do what you want.
getHTML method you can access individual tags by ID
(the "id" attribute) or by name (which does not mean a "name" attribute, but the type of tag
itself, for example "img"), and then do tag-related operations.toPureText).
When you want to access individual tag of your document, always use first the getHTML
to get the HTMLFragment.
If you want to recover an HTML tag whose ID you knowm just use the getTagById(String id)
method. The HTMLTag object returned lets you access the attributes with the
getAttributes method, and then you can retrieve or set any attribute.
Some HTML documents have absolute references to its inline content (inline images, sound, etc.),
which makes it difficult to display the page under a different domain (a typical case would
be the cache of a search engine). But then, you simply make a call to relativizeEmbedded,
and the document no longer has this behaviour. Before using this method, make sure that the
BASE URL has been properly set.
getMetaInfo lets you easily retrieve any document meta-information, though there
are also specialized methods like getHttpEquiv or getKeywords that
directly access those meta-informations.
Example
The following example does some tag-level manipulations:
Reader re = ...
// Create the document
HTMLDoc doc = new HTMLDoc();
// Load its content
doc.load(re);
// Get the HTML
HTMLFragment html = doc.getHTML();
// Create a 'date' meta-tag
HTMLTag tag = HTMLTag.parse("<meta name=\"date\" content=21/01/2001>");
// Insert it just before the title
html.insertBefore(html.findTagByName("title"), tag);
// Create a paragraph
tag = HTMLTag.create("p");
// Insert '<p>Paragraph</p>' just before a tag with id="someid"
html.insertBefore(html.getIdFinder("someid").getTag().getPosition(),
tag.toString("Paragraph"));
// Create an anchor to foo.html
HTMLTag anchor = HTMLTag.parse("<a href=\"foo.html\">");
// We could also do a 'HTMLTag.create("a")' and then set the 'href'
// attribute using getAttributes().setAttribute("href", "foo.html")
//
// Now we get a tag block with id="otherid"
tag = html.getIdFinder("otherid").getTagBlock();
// Replace the tag that has id="otherid" by the same tag
// embraced by the foo.html anchor
html.replace(tag.getBlockPosition(), anchor.toString(tag));
// For example, if the 'otherid' tag was 'img src="something.jpg"',
// then the result would be:
// '<a href="foo.html"><img id="otherid" src="something.jpg"></a>'
//
tag = html.getTagByName("meta");
// We just got the first 'meta' tag found in the document, and now we
// set its name attribute to 'last_update', and its value
// (the 'content' attribute) to "20/01/2001"
tag.getAttributes().setAttribute("name", "last_update");
tag.getAttributes().setAttribute("content", "20/01/2001");
// Commit the changes to the 'meta' tag to the document
html.update(tag);
This class wraps an URLConnection object with HTTP-specific methods,
providing a light-weight HTTP client.
Example of its use to read the content of an URL:
HTTPClient client = new HTTPClient();
client.setUserAgent("My client/1.0");
URLConnection ucon = client.openConnection(url);
InputStream is = ucon.getInputStream();
String retcode = client.returnCode(ucon);
HTTP Form example:
HTTPClient client = new HTTPClient();
client.setUserAgent("My client/1.0");
HTTPForm form = client.createForm();
form.setParameter("name", "John");
form.addParameter("favorite_colors", "blue");
form.addParameter("favorite_colors", "yellow");
URLConnection ucon = form.post(url);
InputStream is = ucon.getInputStream();