PageParser - Content Parser
PageParser is a sublcass HTML::Parser which overrides the functions for start, end and text so we can build the text only version of the page.
An object of this class may have the following variables after parsing a page:
html - the text only output html if the page did not contain any frames
frame_html - frameset html
title - the title of the page as found from the <title> tag
redirect_url - contains the url to redirect the browser if meta redirect found
redirect_delay - number of seconds to wait before the broswer should redirect to the above url
# setup the page parser my $p = PageParser->new; $p->ignore_elements(qw(script style)); $p->{_show_alt_tags} = $cookie_man->{_settings}{show_alt_tags}; $p->{_util} = $utils; $p->{_config} = $config; $p->{_target} = "h*tp://www.example.com/test.html";
# parse some html that is fetched by a PageFetcher object $p->parse($fetcher->fetch_content($config));
This function is not overridden in this child class, but it is worth
mentioning. When parse is called, the object will start parsing
the passed content by calling our custom start,end and text functions.
When the start of a tag is encountered, start
is called. When an
ending tag is encountered, end
is called. When text is found that is
not part of a tag, text
is called. This allows us to extract
all text regardless of what tag it is enclosed by, and to extract only
specific tags that we use for the text only version (e.g. anchor tag)
returns nothing. results are stored in the variables described at the beginning
example:
# parse some html $p->parse($content); # this prints out the title of the parsed document echo $p->{title};
Invoked when the parser encounters the start of a new html tag. Here we pick and choose the HTML tags and their attributes we wish to keep. If kept, it is pushed onto a stack so that the proper end tag can be identified when found.
returns nothing.
example: used internally.
Invoked when the parser encounters a closing HTML tag. Decides if it needs to be appended to the parsed content by looking at the tag stack and determining if there is a start tag of the same type.
returns nothing.
example: used internally.
Invoked when the parser encounters text in the document. This is usually appended to the output if it contains something other than whitespace.
returns nothing.
example: used internally.
A convienience function for combining attributes of HTML tags. The first
parameter is a hash reference to key/value pairs for all known
attributes of an html tag. The second parameter is an array reference
of the desired attributes to combine. If a desired
attribute does not exist, it will be left out. The output will be
a string of the desired attribues in html form. ex:
key1="value1" key2="value2" key3="value3"
returns a string with the desired attributes in html form.
example:
$str = $self->combine_attr( {key1 => 'value1', color => 'red'}, ['color', 'food']); # $str will contain color="red"
See Luci Documentation for more information.