Jan 4, 2010

Extracting text from HTML body

Do you know how to extract text from the HTML <body> tag with one regular expression? Here it is:
function extractBody($htmlContent) {
    $result = '';
    $regExp = '/.*<body[^>]*>(.*)<\/body>.*/is';
    if (preg_match($regExp, $htmlContent)) {
       $result = trim(preg_replace($regExp, '\1', $htmlContent));
    return $result;
Note that you have to use "i" and "s" modifiers, otherwise this regular expression will not work in all cases. "i" modifier helps to detected <body> tag in mixed or upper case. "s" modifier forces regular expression to work correctly with new line characters.


  1. Dmitry -

    I just used this on my Mac:

    lynx -dump example.html >example.txt

    worked fine ;-)

  2. Jens, I do not think you will want to launch Lynx from PHP for this :D