NAME

PageParser - Content Parser


SYNOPSIS

PageParser is a sublcass HTML::Parser which overrides the functions for start, end and text so we can build the text only version of the page.

An object of this class may have the following variables after parsing a page:

html - the text only output html if the page did not contain any frames

frame_html - frameset html

title - the title of the page as found from the <title> tag

redirect_url - contains the url to redirect the browser if meta redirect found

redirect_delay - number of seconds to wait before the broswer should redirect to the above url


USAGE

 # setup the page parser
 my $p = PageParser->new;
 $p->ignore_elements(qw(script style));
 $p->{_show_alt_tags} = $cookie_man->{_settings}{show_alt_tags};
 $p->{_util} = $utils;
 $p->{_config} = $config;
 $p->{_target} = "h*tp://www.example.com/test.html";
 # parse some html that is fetched by a PageFetcher object
 $p->parse($fetcher->fetch_content($config));


DESCRIPTION

We are extending the parent class HTML::Parser. See CPAN documentation for more information.

parse

This function is not overridden in this child class, but it is worth mentioning. When parse is called, the object will start parsing the passed content by calling our custom start,end and text functions. When the start of a tag is encountered, start is called. When an ending tag is encountered, end is called. When text is found that is not part of a tag, text is called. This allows us to extract all text regardless of what tag it is enclosed by, and to extract only specific tags that we use for the text only version (e.g. anchor tag)

returns nothing. results are stored in the variables described at the beginning

example:

 # parse some html
 $p->parse($content);
 # this prints out the title of the parsed document
 echo $p->{title};

start

Invoked when the parser encounters the start of a new html tag. Here we pick and choose the HTML tags and their attributes we wish to keep. If kept, it is pushed onto a stack so that the proper end tag can be identified when found.

returns nothing.

example: used internally.

end

Invoked when the parser encounters a closing HTML tag. Decides if it needs to be appended to the parsed content by looking at the tag stack and determining if there is a start tag of the same type.

returns nothing.

example: used internally.

text

Invoked when the parser encounters text in the document. This is usually appended to the output if it contains something other than whitespace.

returns nothing.

example: used internally.

combine_attr

A convienience function for combining attributes of HTML tags. The first parameter is a hash reference to key/value pairs for all known attributes of an html tag. The second parameter is an array reference of the desired attributes to combine. If a desired attribute does not exist, it will be left out. The output will be a string of the desired attribues in html form. ex: key1="value1" key2="value2" key3="value3"

returns a string with the desired attributes in html form.

example:

 $str = $self->combine_attr( {key1 => 'value1', color => 'red'},
                             ['color', 'food']);
 # $str will contain color="red"


LUCI DOCUMENTATION

See Luci Documentation for more information.


AUTHORS