אחזור פרוטוקולי המליאה (Task #1428)

Added by אדם קריב almost 4 years ago. Updated over 1 year ago.

Status:	New	Start date:	10/26/2010
Priority:	Normal	Due Date::
Assignee:	Ori Hoch	% Done:	0%
Category:	קצירת נתונים	Spent time:	4.00 hours
Target version:	-
מתאים כמטלה ראשונה:	No

Description

בדומה למה שנעשה היום עם כל וועדה

parse.py - Initial script (7.8 kB) Ofir Carny, 12/29/2012 05:35 pm

18_ptm_219371.doc.awdb.xml (459.9 kB) Ori Hoch, 01/07/2013 10:08 pm

03581012.doc.awdb.xml (414.8 kB) Ori Hoch, 01/07/2013 10:08 pm

04085512.doc.awdb.xml (249.7 kB) Ori Hoch, 01/07/2013 10:08 pm

plenum_htmls.zip (2.9 MB) Ori Hoch, 01/13/2013 03:38 pm

History

#1
Updated by Ori Hoch over 1 year ago

need to get the protocols from this url:
http://www.knesset.gov.il/plenum/heb/plenum_search.aspx
and then parse the doc files

Assignee set to Ori Hoch
Category set to קצירת נתונים
מתאים כמטלה ראשונה set to No

#2
Updated by Ofir Carny over 1 year ago

Attached script contains python retrieval code, openoffice based parsing and HTML conversion, and some post conversion HTML parsing and Hebrew unicode handling

File parse.py added

#3
Updated by Ori Hoch over 1 year ago

attached: sample xml files produced using antiword

wiki page for planning the parsing process of these xml files:
[[https://github.com/astupidog/Open-Knesset/wiki/parse-plenum-protocols]]

File 03581012.doc.awdb.xml added
File 04085512.doc.awdb.xml added
File 18_ptm_219371.doc.awdb.xml added

#4
Updated by Ori Hoch over 1 year ago

parsing process is finished and working but there are probably many bugs..

attached are html files that were parsed, hopefully someone will look over them and provide some bug reports

the plenum_htmls.zip file should be extracted somewhere, then you should open the index.html file using a web browser (tested with chrome but should work on other browsers as well)

File plenum_htmls.zip added

#5
Updated by Ori Hoch over 1 year ago

נקודות להמשך:
1. מבחינת מודלים במסד הנתונים נשתמש באותם מודלים וטבלאות של ועדות כאשר יתווסף שדה "type" לכל ועדה שיסמן אותה כ"ועדה" או כ"ועדת מליאה". תהיה רק "ועדת מליאה" אחת.
2. הוספת תנאי לפונקציה של create_protocol_parts כך שתבדוק אם מדובר ב"ועדת מליאה" ואז תבצע parsing שונה.
3. כל החלקים יעובדו לprotocolParts כאש יתווסף שדה של סוג: "דובר", "כותרת", "טקסט ללא דובר" וכו'.
4. השדה protocol_text ישמש לשמור את הטקסט הנקי של הפרוטוקול כך שנוכל לבצע re-parse בלי להוריד שוב את הפרוטוקול מאתר הכנסת.

Also available in: Atom PDF

Login	Password

Issues

אחזור פרוטוקולי המליאה (Task #1428)

History

#1 Updated by Ori Hoch over 1 year ago

#2 Updated by Ofir Carny over 1 year ago

#3 Updated by Ori Hoch over 1 year ago

#4 Updated by Ori Hoch over 1 year ago

#5 Updated by Ori Hoch over 1 year ago

#1
Updated by Ori Hoch over 1 year ago

#2
Updated by Ofir Carny over 1 year ago

#3
Updated by Ori Hoch over 1 year ago

#4
Updated by Ori Hoch over 1 year ago

#5
Updated by Ori Hoch over 1 year ago