Do you know how to extract text from the HTML <body> tag with one regular expression? Here it is:
function extractBody($htmlContent) {
$result = '';
$regExp = '/.*<body[^>]*>(.*)<\/body>.*/is';
if (preg_match($regExp, $htmlContent)) {
$result = trim(preg_replace($regExp, '\1', $htmlContent));
}
return $result;
}
Note that you have to use "i" and "s" modifiers, otherwise this regular expression will not work in all cases. "i" modifier helps to detected <body> tag in mixed or upper case. "s" modifier forces regular expression to work correctly with new line characters.$result = '';
$regExp = '/.*<body[^>]*>(.*)<\/body>.*/is';
if (preg_match($regExp, $htmlContent)) {
$result = trim(preg_replace($regExp, '\1', $htmlContent));
}
return $result;
}
Dmitry -
ReplyDeleteI just used this on my Mac:
lynx -dump example.html >example.txt
worked fine ;-)
Jens, I do not think you will want to launch Lynx from PHP for this :D
ReplyDelete