의안의 부가정보 수집하기 #39

e9t · 2015-12-03T10:41:06Z

지금은 의안 크롤러가 의안의 "부가정보"를 수집하지 않고 있는데, 대안 의안들의 경우 이 영역에 관련 의안이 표기되어 있기 때문에 무척 중요한 정보를 놓치고 있는 꼴입니다. 이 데이터를 추가적으로 수집하기 위해서는 html을 json으로 파싱하는 파일을 수정하면 됩니다.

현재

for i, r in enumerate(elem_row_contents):
    if row_titles[i]!='부가정보':  # "부가정보" 외의 다른 영역(행)들 처리
        status_dict[row_titles[i]] = extract_row_contents(r)
    else:  # "부가정보" 영역 처리
        t = r.xpath('span[@class="text8"]/text()')
        c = filter(None, (t.strip() for t in r.xpath('text()')))
        status_dict[row_titles[i]] = dict(zip(t, c))

개선: 아마 위의 코드 snippet에서 "부가정보" 영역을 처리하는 곳에서 xpath가 정상적으로 작동하지 않는 것 같습니다. 디버깅하는 것이 아마 크게 어려운 일은 아닐 것 같은데, html 파일을 다시 찬찬히 뜯어보는 노력이 필요합니다.

혹시 xpath의 사용법에 익숙하지 않으신 분들이 있다면 다음 링크를 확인해주시기 바랍니다: http://www.slideshare.net/lucypark/the-beginners-guide-to-54279917/49

--- Want to back this issue? **[Post a bounty on it!](https://www.bountysource.com/issues/28812970-?utm_campaign=plugin&utm_content=tracker%2F248104&utm_medium=issues&utm_source=github)** We accept bounties via [Bountysource](https://www.bountysource.com/?utm_campaign=plugin&utm_content=tracker%2F248104&utm_medium=issues&utm_source=github).

mijungk · 2015-12-10T15:33:58Z

안녕하세요. 디버깅을 시도 했지만, 생각보다 문제가 간단하지 않아 제 미숙한 경험으로 해결은 하지 못했습니다. 그래도 도움이 될까해서, 문제점만 공유합니다.

현재

t = r.xpath('span[@class="text8"]/text()')

가 아무것도 가져오질 못하고 있는데, class="text11" 으로 수정되어야 합니다.

사실, 더 큰 문제는 dict(zip(t, c)) 부분인데요.

c = filter(None, (t.strip() for t in r.xpath('text()')))

가 "비고" 와 "대안반영폐기 의안목록"을 일괄적으로 한 리스트로 가져 오기 때문에, t 와 c가 제대로 매핑 되고 있지 않습니다. 현재 dict(zip(t, c)) 는 {{비고, 비고 항목1 (국회법 제85조의3제5항....)},{대안반영폐기 의안목록, 비고 항목2(국회법 제85조의3제2항...)}} 이렇게 매핑 되고 있습니다. "비고" 아래 두 항목이 html상에 tag 없이 plain text로 적혀 있어서 xpath 로 파싱이 잘 안되네요.

e9t · 2015-12-11T05:01:51Z

@mjkim720 감사합니다! 공유해주신 정보가 많은 도움이 될 것 같네요 :)

mithrandir · 2015-12-31T12:41:49Z

부가정보에 들어가는 항목을 보니
비고: 텍스트
대안반영 폐기목록: 링크
대안: 링크

정도가 있습니다. 부가정보에 어떤 내용이 추가로 들어갈 지 알 수 없으니, 비고, 대안반영 폐기목록, 대안 에 대해 각각 파싱을 하고 그 이외의 항목에 대해서는 로그를 남기거나 unknown value 로 저장해둘 필요가 있다고 생각합니다.

extract_remark, extract_revoked_bills_by_alternative, extract_alternative_bill 정도로 각각 파싱 함수를 만들면 좋을 것 같습니다.

각각이 반환하는 내용으로는
extract_remark = [text of each line]
extract_revoked_bills_by_alternative = [bill_id]
extract_alternative_bill = [bill_id] or bill_id
로 하면 되지 않을까요?

mithrandir · 2015-12-31T12:46:30Z

html을 다시 살펴보니 항목 내용 항목 내용 따위로 되어 있어서 처리하기가 애매하군요. 특정 xpath 사이의 영역을 뽑는 방법이 있는지 찾아봐야겠습니다. text() 만 가지고 뽑기는 어렵겠네요.

* 비고: use br.tail or any.text * 대안, 대안반영폐기 목록: use bill_id extracted from link_id from href this commit fixes issue teampopong#39

e9t added the bug label Dec 3, 2015

mithrandir added a commit to mithrandir/crawlers that referenced this issue Dec 31, 2015

Parse '부가정보' part correctly

94e6c08

* 비고: use br.tail or any.text * 대안, 대안반영폐기 목록: use bill_id extracted from link_id from href this commit fixes issue teampopong#39

mithrandir mentioned this issue Dec 31, 2015

Parse '부가정보' part correctly #41

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

의안의 부가정보 수집하기 #39

의안의 부가정보 수집하기 #39

e9t commented Dec 3, 2015 •

edited

Loading

mijungk commented Dec 10, 2015

e9t commented Dec 11, 2015

mithrandir commented Dec 31, 2015

mithrandir commented Dec 31, 2015

의안의 부가정보 수집하기 #39

의안의 부가정보 수집하기 #39

Comments

e9t commented Dec 3, 2015 • edited Loading

mijungk commented Dec 10, 2015

e9t commented Dec 11, 2015

mithrandir commented Dec 31, 2015

mithrandir commented Dec 31, 2015

e9t commented Dec 3, 2015 •

edited

Loading