PHP로 HTML 코드 문자열 추출

이 표현식은 숫자 일 때 꺾쇠 괄호> < 사이의 값만 가져옵니다. 나는 그 (것)들을 어느 것이 든에서 얻고 싶다.PHP로 HTML 코드 문자열 추출

<a class="producto" href="ver.asp?id=4013">A86028</a></span><!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">1027C</a></span><!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">5611 4020</a></span> 
<!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">396-4185</a></span> 
<!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">834006-5-7</a></span> 
<!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">5601GR 4325GR</a></span> 
<!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">2182CR(2)</a></span> 
<!-- /a --></td></tr> 
    <a class="producto" href="ver.asp?id=4014">1458-54-63-55</a></span> 
<!-- /a --></td></tr>

내 원하는 출력은 다음과 같습니다 :

function GetProducts($file){ 
    $regex = "|class=\"producto\"[^>]+>([0-9]*)</[^>]+>|U"; 
    if(!is_file($file)) return false; 
    preg_match_all($regex,file_get_contents($file), $result); 
    foreach($result[1] as $key =>$value) $result[$key] = (int) $value; 
    return $result; 
}

이 내 HTML 코드입니다

Array ([1] => 1027 [2] => 5611 [3] => 5396 [4] => 834006 [5] => 5601 [6] => 2182 [7] => 1458)

출처

2014-09-11 Javier Sega

[정규식 HTML을 구문 분석하지 마십시오!] (http://stackoverflow.com/a/1732454/418066) – Biffen

원하는 출력은 무엇인가? –

배열 ([1] => 1027 [2] => 5611 [3] => 5396 [4] => 834006 [5] => 5601 [6] => 2182 [7] => 1458) –

는이 같은 정규식을 사용할 수 있습니다

([\w\s-\(\)]+)</

Working demo

enter image description here

아이디어는 전에 영숫자, 대시 및 paretheses을 캡처하는 것입니다.

출처

2014-09-11 21:06:00

이 작동 할 수 있지만, 사람들이 정규식과 HTML을 구문 분석 말대로 문제가있다.

# class="producto"[^>]+>([^<]*)</[^>]+> 

class="producto" [^>]+ > 
([^<]*) 
</ [^>]+ >

출처

2014-09-11 20:37:14 sln

HTML 정규식 파싱을 막는 매우 포스트에 대한 대답을 얻었습니다. 정규 표현식에 임의의 HTML을 구문 분석하도록 요청하는 것은 패리스 힐튼에게 운영체제를 작성하라고 요청하는 것과 같지만 때로는 제한된 알려진 세트를 파싱하는 것이 적절합니다. HTML **. 그리고 여기가 바로 그 경우입니다. – LSerni

그래, 내가 html과 그 문제가 여전히 15k 정규식을 구문 분석 던질 수 있습니다. 특히 엔티티 및 대체. 나는 이것이 html의 알려진 집합에 속한다고 합리화한다. – sln

여기에 순수한 정규 표현식을 요청했지만 HTML을 구문 분석하는 데는 tool이 적합하지 않습니다.

function _matcher ($m, $str) { 
    if (preg_match('/^\d+/', $str, $matches)) 
    $m[] = $matches[0]; 
    return $m; 
} 

$dom = new DOMDocument; 
$dom->loadHTML($html); 
$xpath = new DOMXPath($dom); 

foreach ($xpath->query('//a[@class="producto"]') as $link) { 
    $vals[] = $link->nodeValue; 
} 

print_r(array_reduce($vals, '_matcher', array()));

출력 (Working Demo)

Array 
(
    [0] => 1027 
    [1] => 5611 
    [2] => 396 
    [3] => 834006 
    [4] => 5601 
    [5] => 2182 
    [6] => 1458 
)

출처

2014-09-11 21:37:28 hwnd

PHP로 HTML 코드 문자열 추출

답변

관련 문제