d:dZddlZddlZddlZddlmZmZmZddlm Z ddl m Z ddl m Z mZmZddlmZdd lmZdd lmZdd lmZdd lmZdd lmZddlmZGddZy)a Module containing the UniversalDetector detector class, which is the primary class a user of ``chardet`` should use. :author: Mark Pilgrim (initial port to Python) :author: Shy Shalom (original C code) :author: Dan Blanchard (major refactoring for 3.0) :author: Ian Cordasco N)ListOptionalUnion)CharSetGroupProber) CharSetProber) InputStateLanguageFilter ProbingState)EscCharSetProber) Latin1Prober)MacRomanProber)MBCSGroupProber) ResultDict)SBCSGroupProber) UTF1632Proberc NeZdZdZdZej dZej dZej dZ dddd d d d d dZ dddd ddddZ e jdfde deddfdZedefdZedefdZedeefdZd!dZdeeefddfdZdefd Zy)"UniversalDetectoraq The ``UniversalDetector`` class underlies the ``chardet.detect`` function and coordinates all of the different charset probers. To get a ``dict`` containing an encoding and its confidence, you can simply run: .. code:: u = UniversalDetector() u.feed(some_bytes) u.close() detected = u.result g?s[-]s(|~{)s[-]z Windows-1252z Windows-1250z Windows-1251z Windows-1256z Windows-1253z Windows-1255z Windows-1254z Windows-1257) iso-8859-1z iso-8859-2z iso-8859-5z iso-8859-6z iso-8859-7z iso-8859-8 iso-8859-9z iso-8859-13z ISO-8859-11GB18030CP949UTF-16)asciirztis-620rgb2312zeuc-krzutf-16leF lang_filtershould_rename_legacyreturnNcd|_d|_g|_dddd|_d|_d|_t j|_d|_ ||_ tjt|_d|_||_|j#y)Nencoding confidencelanguageF)_esc_charset_prober_utf1632_prober_charset_probersresultdone _got_datar PURE_ASCII _input_state _last_charrlogging getLogger__name__logger_has_win_bytesrreset)selfrrs ;/usr/lib/python3/dist-packages/chardet/universaldetector.py__init__zUniversalDetector.__init__ds @D 8<57#   &11&''1 #$8! r%c|jSN)r-r5s r6 input_statezUniversalDetector.input_state{s   r%c|jSr9)r3r:s r6 has_win_byteszUniversalDetector.has_win_bytess"""r%c|jSr9)r(r:s r6charset_probersz!UniversalDetector.charset_proberss$$$r%cVdddd|_d|_d|_d|_tj |_d|_|jr|jj|jr|jj|jD]}|jy)z Reset the UniversalDetector and all of its probers back to their initial states. This is called by ``__init__``, so you only need to call this directly in between analyses of different documents. Nr r!Fr%) r)r*r+r3r r,r-r.r&r4r'r()r5probers r6r4zUniversalDetector.resets $(sM  #&11  # #  $ $ * * ,     & & (++ F LLN r%byte_strcV |jry|syt|ts t|}|js|j t j r dddd|_n|j t jt jfr dddd|_nt|j dr dddd|_nW|j d r d ddd|_n:|j t jt jfr d ddd|_d |_|jd d |_y|jtjk(r|jj!|rtj"|_ nZ|jtjk(r=|j$j!|j&|zrtj(|_ |dd|_|j*st-|_|j*j.t0j2k(rk|j*j5|t0j6k(r?|j*j8|j*j;dd|_d |_y|jtj(k(r|j<st?|j@|_|j<j5|t0j6k(rS|j<j8|j<j;|j<jBd|_d |_yy|jtj"k(r:|jDstG|j@g|_"|j@tHjJzr#|jDjMtO|jDjMtQ|jDjMtS|jDD]Z}|j5|t0j6k(s&|j8|j;|jBd|_d |_n|jTj!|rd |_+yyy)a Takes a chunk of a document and feeds it through all of the relevant charset probers. After calling ``feed``, you can check the value of the ``done`` attribute to see if you need to continue feeding the ``UniversalDetector`` more data, or if it has made a prediction (in the ``result`` attribute). .. note:: You should always call ``close`` when you're done feeding in your document if ``done`` is not already ``True``. Nz UTF-8-SIG?r!zUTF-32szX-ISO-10646-UCS-4-3412szX-ISO-10646-UCS-4-2143rTr"),r* isinstance bytearrayr+ startswithcodecsBOM_UTF8r) BOM_UTF32_LE BOM_UTF32_BEBOM_LEBOM_BEr-r r,HIGH_BYTE_DETECTORsearch HIGH_BYTE ESC_DETECTORr. ESC_ASCIIr'rstater DETECTINGfeedFOUND_IT charset_nameget_confidencer&r rr$r(rr NON_CJKappendrr rWIN_BYTE_DETECTORr3)r5rBrAs r6rWzUniversalDetector.feeds 99  (I. *H~~""6??3!,"% " $$f&9&96;N;N%OP,43TVW $$%89!9"% "  $$%89!9"% "  $$fmmV]]%CD,43TVW !DN{{:&2      5 5 5&&--h7$.$8$8!!!Z%:%::%%,,T__x-GH$.$8$8!"23-###0?D    % %)?)? ?##((2l6K6KK $ 4 4 A A"&"6"6"E"E"G " !     4 4 4+++;D// ;;x(L,A,AA$*$7$7&,&;&;&=$*OO#DK !%DI %%,,X6&*#7#7r%c H|jr |jSd|_|js|jj dnD|j t jk(r dddd|_n|j t jk(rd}d}d}|jD]}|s|j}||kDs|}|}!|r||jkDr|j}|J|j}|j}|jd r(|jr|j j#||}|j$r.|j&j#|xsdj|}|||j(d|_|jj+t,j.kr|jd |jj d |jD]}|st1|t2rR|j4D]B}|jj d |j|j(|jDh|jj d |j|j(|j|jS) z Stop analyzing the current document and come up with a final prediction. :returns: The ``result`` attribute, a ``dict`` with the keys `encoding`, `confidence`, and `language`. Tzno data received!rrDrEr!Nr ziso-8859r"z no probers hit minimum thresholdz%s %s confidence = %s)r*r)r+r2debugr-r r,rRr(rZMINIMUM_THRESHOLDrYlowerrIr3 ISO_WIN_MAPgetr LEGACY_MAPr$getEffectiveLevelr/DEBUGrGrprobers) r5prober_confidencemax_prober_confidence max_proberrArYlower_charset_namer# group_probers r6closezUniversalDetector.closesp 99;;  ~~ KK  1 2  *"7"7 7'.crRDK  *"6"6 6 $ $' !J// ($*$9$9$;!$'<<,=)!'J  (4t7M7MM)66 #///%1%7%7%9"'668 &00<**'+'7'7';';. ( ,,#'??#6#6%+224l$L!-", * 3 3  ;; ( ( *gmm ;{{:&. !!"DE$($9$9L' !,0BC&2&:&:F KK-- 7 & 3 3 & & 5 5 7  ))3(55(11(779 ${{r%)rN)r1 __module__ __qualname____doc__r`recompilerPrSr]rbrdr ALLboolr7propertyintr;r=rrr?r4rbytesrHrWrrmr%r6rr8s7 #N32::l+L" >2$$$$$$$% K $ $J'5&8&8%*##  .!S!!#t##%m!4%%&A+U5)#34A+A+FMzMr%r)rprJr/rqtypingrrrcharsetgroupproberr charsetproberrenumsr r r escproberr latin1proberr macromanproberrmbcsgroupproberr resultdictrsbcsgroupproberr utf1632proberrrrxr%r6rsH8 ((2(;;'&*,",(rrr%