Jan 4, 2010

Extracting text from HTML body

Do you know how to extract text from the HTML <body> tag with one regular expression? Here it is:
function extractBody($htmlContent) {
    $result = '';
    $regExp = '/.*<body[^>]*>(.*)<\/body>.*/is';
    if (preg_match($regExp, $htmlContent)) {
       $result = trim(preg_replace($regExp, '\1', $htmlContent));
    }
    return $result;
}
Note that you have to use "i" and "s" modifiers, otherwise this regular expression will not work in all cases. "i" modifier helps to detected <body> tag in mixed or upper case. "s" modifier forces regular expression to work correctly with new line characters.

2 comments:

  1. Dmitry -

    I just used this on my Mac:



    lynx -dump example.html >example.txt



    worked fine ;-)

    ReplyDelete
  2. Jens, I do not think you will want to launch Lynx from PHP for this :D

    ReplyDelete