§ ñ§gG$ãó —UdZddlmZmZmZddlmZmZddlm Z e j e¦«ZdZ dZdZdZd Zd ZdZeded edededediZeeefed<d„e ¦«D¦«Zeeefed<Gd„de¦«ZdS)z Tokenization classes for CANINE.é)ÚDictÚListÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úloggingiiàiàiàiàiàz[CLS]z[SEP]z[BOS]z[MASK]z[PAD]z [RESERVED]ÚSPECIAL_CODEPOINTScó—i|]\}}||“Œ S©r)Ú.0Ú codepointÚnames új/var/www/html/ai-engine/env/lib/python3.11/site-packages/transformers/models/canine/tokenization_canine.pyú r7s€Ð-pÐ-pÐ-pÁ/À)ÈT¨d°IÐ-pÐ-pÐ-póÚSPECIAL_CODEPOINTS_BY_NAMEc ó‡—eZdZdZee¦«ee¦«ee¦«ee¦«ee¦«ee¦«ddfˆfd„ Z e defd„¦«Zd„Z dedeefd „Zd edefd„Zdedefd „Zd„Z ddeedeeedeefd„Z ddeedeeededeefˆfd„ Z ddeedeeedeefd„Zddedeefd„ZˆxZS)ÚCanineTokenizeraé Construct a CANINE tokenizer (i.e. a character splitter). It turns text into a sequence of characters, and then converts each character into its Unicode code point. [`CanineTokenizer`] inherits from [`PreTrainedTokenizer`]. Refer to superclass [`PreTrainedTokenizer`] for usage examples and documentation concerning parameters. Args: model_max_length (`int`, *optional*, defaults to 2048): The maximum sentence length the model accepts. Fic ó2•—t|t¦«rt|dd¬¦«n|}t|t¦«rt|dd¬¦«n|}t|t¦«rt|dd¬¦«n|}t|t¦«rt|dd¬¦«n|}t|t¦«rt|dd¬¦«n|}t|t¦«rt|dd¬¦«n|}i|_t ¦«D]\} }| |j|<Œd„|j ¦«D¦«|_t|_t|j¦«|_ t¦«jd||||||||dœ| ¤ŽdS)NF)ÚlstripÚrstripTcó—i|]\}}||“Œ Srr)r rrs rrz,CanineTokenizer.__init__..cs+€ð; ð; ð; Ù / iˆItð; ð; ð; r)Ú bos_tokenÚ eos_tokenÚ sep_tokenÚ cls_tokenÚ pad_tokenÚ mask_tokenÚadd_prefix_spaceÚmodel_max_lengthr) Ú isinstanceÚstrrÚ_special_codepointsr ÚitemsÚ_special_codepoint_stringsÚUNICODE_VOCAB_SIZEÚ_unicode_vocab_sizeÚlenÚ_num_special_tokensÚsuperÚ__init__) Úselfrrrrrrr r!ÚkwargsrrÚ __class__s €rr,zCanineTokenizer.__init__Hsäø€õJTÐT]Õ_bÑIcÔIcÐr•J˜y°¸uÐEÑEÔEÐEÐirˆ ÝISÐT]Õ_bÑIcÔIcÐr•J˜y°¸uÐEÑEÔEÐEÐirˆ ÝISÐT]Õ_bÑIcÔIcÐr•J˜y°¸uÐEÑEÔEÐEÐirˆ ÝISÐT]Õ_bÑIcÔIcÐr•J˜y°¸uÐEÑEÔEÐEÐirˆ ÝISÐT]Õ_bÑIcÔIcÐr•J˜y°¸uÐEÑEÔEÐEÐirˆ õKUÐU_ÕadÑJeÔJeÐu•Z °4ÀÐFÑFÔFÐFÐkuˆ ð46ˆÔ Ý1×7Ò7Ñ9Ô9ð 7ð 7‰OˆItØ-6ˆDÔ$ TÑ*Ð*ð; ð; Ø37Ô3K×3QÒ3QÑ3SÔ3Sð; ñ; ô; ˆÔ'õ$6ˆÔ Ý#& tÔ'?Ñ#@Ô#@ˆÔ à‰ŒÔð ØØØØØØ!Ø-Ø-ð ð ðð ð ð ð ð rÚreturncó—|jS©N)r()r-s rÚ vocab_sizezCanineTokenizer.vocab_sizevs€àÔ'Ð'rcóv—d„t|j¦«D¦«}| |j¦«|S)Ncó.—i|]}t|¦«|“ŒSr)Úchr)r Úis rrz-CanineTokenizer.get_vocab..{s €Ð;Ð;Ð;˜q•Q‘”˜Ð;Ð;Ð;r)Úranger3ÚupdateÚadded_tokens_encoder)r-Úvocabs rÚ get_vocabzCanineTokenizer.get_vocabzs9€Ø;Ð;¥E¨$¬/Ñ$:Ô$:Ð;Ñ;Ô;ˆØ ŠTÔ.Ñ/Ô/Ð/ØˆrÚtextcó —t|¦«S)z5Tokenize a string (i.e. perform character splitting).)Úlist)r-r=s rÚ _tokenizezCanineTokenizer._tokenizes€åD‰zŒzÐrÚtokencód— t|¦«S#t$rtd|›d¦«‚wxYw)zaConverts a token (i.e. a Unicode character) in an id (i.e. its integer Unicode code point value).zinvalid token: 'ú')ÚordÚ TypeErrorÚ ValueError)r-rAs rÚ_convert_token_to_idz$CanineTokenizer._convert_token_to_idƒsH€ð :Ýu‘:”:ÐøÝð :ð :ð :ÝÐ8°Ð8Ð8Ð8Ñ9Ô9Ð9ð :øøøs‚‘/ÚindexcóŽ— |tvr t|St|¦«S#t$rtd|›¦«‚wxYw)z˜ Converts a Unicode code point (integer) in a token (str). In case it's a special code point, convert to human-readable format. zinvalid id: )r r6rErF)r-rHs rÚ_convert_id_to_tokenz$CanineTokenizer._convert_id_to_tokenŠs\€ð 5ØÕ*Ð*Ð*Ý)¨%Ô0Ð0Ýu‘:”:ÐøÝð 5ð 5ð 5ÝÐ3¨EÐ3Ð3Ñ4Ô4Ð4ð 5øøøs ‚'˜'§Acó,—d |¦«S)NÚ)Újoin)r-Útokenss rÚconvert_tokens_to_stringz(CanineTokenizer.convert_tokens_to_string–s€ØwŠwv‰ŒÐrNÚtoken_ids_0Útoken_ids_1cóJ—|jg}|jg}||z|z}||||zz }|S)a˜ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CANINE sequence has the following format: - single sequence: `[CLS] X [SEP]` - pair of sequences: `[CLS] A [SEP] B [SEP]` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. )Úsep_token_idÚcls_token_id©r-rPrQÚsepÚclsÚresults rÚ build_inputs_with_special_tokensz0CanineTokenizer.build_inputs_with_special_tokens™sC€ð&Ô Ð!ˆØÔ Ð!ˆà{Ñ" SÑ(ˆØÐ"Øk CÑ'Ñ'ˆFØˆ rÚalready_has_special_tokenscóÂ•—|r$t¦« ||d¬¦«Sdgdgt|¦«zzdgz}||dgt|¦«zdgzz }|S)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)rPrQrZér)r+Úget_special_tokens_maskr))r-rPrQrZrXr/s €rr]z'CanineTokenizer.get_special_tokens_mask´s…ø€ð$&ð Ý‘7”7×2Ò2Ø'°[Ð]að3ñôð ð˜˜c +Ñ.Ô.Ñ.Ñ/°1°#Ñ5ˆØÐ"Ø˜sS Ñ-Ô-Ñ-°!°Ñ4Ñ4ˆFØˆ rcóŽ—|jg}|jg}t||z|z¦«dgz}||t||z¦«dgzz }|S)aÓ Create a mask from the two sequences passed to be used in a sequence-pair classification task. A CANINE sequence pair mask has the following format: ``` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | ``` If `token_ids_1` is `None`, this method only returns the first portion of the mask (0s). Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s). rNr\)rSrTr)rUs rÚ$create_token_type_ids_from_sequencesz4CanineTokenizer.create_token_type_ids_from_sequencesÐsa€ð.Ô Ð!ˆØÔ Ð!ˆåS˜;Ñ&¨Ñ,Ñ-Ô-°°Ñ3ˆØÐ"Ø•c˜+¨Ñ+Ñ,Ô,°¨sÑ2Ñ2ˆFØˆ rÚsave_directoryÚfilename_prefixcó—dS)Nrr)r-r`ras rÚsave_vocabularyzCanineTokenizer.save_vocabularyðs€Øˆrrr2)NF)Ú__name__Ú __module__Ú__qualname__Ú__doc__r6ÚCLSÚSEPÚPADÚMASKr,ÚpropertyÚintr3r<r#rr@rGrJrOrrYÚboolr]r_rcÚ __classcell__)r/s@rrr:sZø€€€€€ððð#c‘(”(Ø#c‘(”(Ø#c‘(”(Ø#c‘(”(Ø#c‘(”(Ø3t‘9”9ØØð, ð, ð, ð, ð, ð, ð\ð(˜Cð(ð(ð(ñ„Xð(ðððð ˜cð d¨3¤iððððð:¨#ð:°#ð:ð:ð:ð:ð 5¨#ð 5°#ð 5ð 5ð 5ð 5ððððJNððØ œ9ðØ3;¸DÀ¼IÔ3Fðà ˆcŒððððð8sxððØ œ9ðØ3;¸DÀ¼IÔ3FðØkoðà ˆcŒððððððð:JNððØ œ9ðØ3;¸DÀ¼IÔ3Fðà ˆcŒððððð@ð¨cðÀHÈSÄMððððððððrrN)rgÚtypingrrrÚtokenization_utilsrrÚutilsr Ú get_loggerrdÚloggerr'rjrhriÚBOSrkÚRESERVEDr rmr#Ú__annotations__r%rrrrrúrxsKðð'Ð&Ð&à'Ð'Ð'Ð'Ð'Ð'Ð'Ð'Ð'Ð'àAÐAÐAÐAÐAÐAÐAÐAØÐÐÐÐÐð ˆÔ ˜HÑ %Ô %€ðÐð€Ø€Ø€Ø€Ø €Ø€ðˆØˆØˆØˆ(ØˆØˆlð &ÐD˜˜c˜”Nð ð ñ ð .qÐ-pÐUg×UmÒUmÑUoÔUoÐ-pÑ-pÔ-pÐ˜D c œNÐpÐpÑpðwðwðwðwðwÐ)ñwôwðwðwðwr