§ ñ§gm<ã ó€—dZddlmZmZmZddlZddlmZddl m Z ddlmZm Z mZmZddlmZmZd d lmZGd„ded ¬¦«ZGd„de d ¬¦«Zdeededeeefd„Zdeeeedeeedededejf d„Zdedededefd„ZGd„de¦«ZdS) zProcessor class for Mllama.é)ÚListÚOptionalÚUnionNé)ÚBatchFeature)Ú ImageInput)ÚImagesKwargsÚProcessingKwargsÚProcessorMixinÚUnpack)ÚPreTokenizedInputÚ TextInputé)Úmake_list_of_imagescó&—eZdZUeeed<dS)ÚMllamaImagesKwargsÚmax_image_tilesN)Ú__name__Ú __module__Ú__qualname__rÚintÚ__annotations__©óúh/var/www/html/ai-engine/env/lib/python3.11/site-packages/transformers/models/mllama/processing_mllama.pyrr"s"€€€€€€Ø˜c”]Ð"Ð"Ñ"Ð"Ð"rrF)Útotalcó&—eZdZUeed<dddiiZdS)ÚMllamaProcessorKwargsÚ images_kwargsÚimage_kwargsréN)rrrrrÚ _defaultsrrrrr&s2€€€€€€Ø%Ð%Ð%Ñ%ð Ø˜qð ð€I€I€IrrÚ input_idsÚimage_token_idÚreturncóÈ‡—ˆfd„t|¦«D¦«}t|¦«dkrgSt|¦«dkr|ddggSd„t|dd…|dd…¦«D¦«}| |dt|¦«g¦«|dd}|ddd…D]$}|d|ddz kr||d<|d}Œ%|S)aó Generate a cross-attention token mask for image tokens in the input sequence. This function identifies the positions of image tokens in the input sequence and creates a mask that defines which subsequent tokens each image token should attend to. Args: input_ids (List[int]): A list of token ids representing the input sequence. image_token_id (int): The id of the token used to represent images in the sequence. Returns: List[List[int]]: A list of [start, end] pairs, where each pair represents the range of tokens an image token should attend to. Notes: - If no image tokens are present, an empty list is returned. - For a single image token, it attends to all subsequent tokens until the end of the sequence. - For multiple image tokens, each attends to tokens up to the next image token or the end of the sequence. - Consecutive image tokens are treated as a group and attend to all subsequent tokens together. có&•—g|] \}}|‰k¯|‘ŒSrr)Ú.0ÚiÚtokenr$s €rú z2get_cross_attention_token_mask..Fs(ø€Ð_Ð_Ð_¡8 1 eÀuÐP^ÒG^ÐG^˜QÐG^ÐG^ÐG^rrréÿÿÿÿcó—g|] \}}||g‘Œ Srr)r(Úloc1Úloc2s rr+z2get_cross_attention_token_mask..Os €ÐnÐnÐn¡Z T¨4T˜4LÐnÐnÐnrN)Ú enumerateÚlenÚzipÚappend)r#r$Úimage_token_locationsÚvision_masksÚ last_mask_endÚvision_masks ` rÚget_cross_attention_token_maskr80s"ø€ð,`Ð_Ð_Ð_y¸Ñ/CÔ/CÐ_Ñ_Ô_Ðå Ð Ñ!Ô! QÒ&Ð&Øˆ õÐ Ñ!Ô! QÒ&Ð&Ø& qÔ)¨2Ð.Ð/Ð/ànÐnµ3Ð7LÈSÈbÈSÔ7QÐShÐijÐikÐikÔSlÑ3mÔ3mÐnÑnÔn€Lð×ÒÐ.¨rÔ2µC¸ ±N´NÐCÑDÔDÐDð ! Ô$ QÔ'€MØ# D D b DÔ)ð'ð'ˆØqŒ>˜[¨œ^¨aÑ/Ò/Ð/Ø*ˆK˜‰NØ# Aœˆ ˆ àÐrÚcross_attention_token_maskÚ num_tilesÚ max_num_tilesÚlengthc ó°—t|¦«}td„|D¦«¦«}tj||||ftj¬¦«}tt ||¦«¦«D]k\}\}} tt || ¦«¦«D]E\} \}}t|¦«dkr*|\} }t||¦«}|dkr|}d||| |…| d|…f<ŒFŒl|S)a Convert the cross attention mask indices to a cross attention mask 4D array. This function takes a sparse representation of cross attention masks and converts it to a dense 4D numpy array. The sparse representation is a nested list structure that defines attention ranges for each image in each batch item. Args: cross_attention_token_mask (List[List[List[int]]]): A nested list structure where: - The outer list represents the batch dimension. - The middle list represents different images within each batch item. - The inner list contains pairs of integers [start, end] representing token ranges for each image. num_tiles (List[List[int]]): A nested list structure specifying the number of tiles for each image in each batch item. max_num_tiles (int): The maximum possible number of tiles. length (int): The total sequence length of the input. Returns: np.ndarray: A 4D numpy array of shape (batch_size, length, max_num_images, max_num_tiles) The array contains `1` where attention is allowed and `0` where it is not. Note: - Special handling is done for cases where the end token is -1, which is interpreted as attending to the end of the sequence. có,—g|]}t|¦«‘ŒSr©r1)r(Úmaskss rr+z@convert_sparse_cross_attention_mask_to_dense..~s€ÐMÐMÐM¨#˜e™*œ*ÐMÐMÐMr)ÚshapeÚdtypeér,rN)r1ÚmaxÚnpÚzerosÚint64r0r2Úmin)r9r:r;r<Ú batch_sizeÚmax_num_imagesÚcross_attention_maskÚ sample_idxÚsample_masksÚsample_num_tilesÚmask_idxÚ locationsÚmask_num_tilesÚstartÚends rÚ,convert_sparse_cross_attention_mask_to_denserT`s€õ:Ð/Ñ0Ô0€JÝÐMÐMÐ2LÐMÑMÔMÑNÔN€Nåœ8Ø˜6 >°=ÐAÝŒhðñôÐõ 9BÅ#ÐF`ÐbkÑBlÔBlÑ8mÔ8mð[ð[Ñ4ˆ Ñ4\Ð#3Ý5>½sÀ<ÐQaÑ?bÔ?bÑ5cÔ5cð [ð [Ñ1ˆHÑ1y .Ý9‰~Œ~ Ò"Ð"Ø&‘ sÝ˜#˜vÑ&Ô&Ø˜"’99Ø CØYZÐ$ Z°°s°¸HÀoÀ~ÀoÐ%UÑVøð [ð ÐrÚpromptÚ bos_tokenÚimage_tokencó´—||vr|Sd}| |¦«r1|t|¦«d…}|dz }| |¦«°1||z›|›|›S)a\ Builds a string from the input prompt by adding `bos_token` if not already present. Args: prompt (`str`): The input prompt string. bos_token (`str`): The beginning of sentence token to be added. image_token (`str`): The image token used to identify the start of an image sequence. Returns: str: The modified prompt string with the `bos_token` added if necessary. Examples: >>> build_string_from_input("Hello world", "", "<|image|>") 'Hello world' >>> build_string_from_input("<|image|>Hello world", "", "<|image|>") '<|image|>Hello world' >>> build_string_from_input("Hello world", "", "<|image|>") 'Hello world' rNr)Ú startswithr1)rUrVrWÚnum_image_tokens_on_starts rÚbuild_string_from_inputr[s‰€ð4FÐÐØˆ à !ÐØ × Ò ˜KÑ (Ô (ð'Ø˜KÑ(Ô(Ð*Ð*Ô+ˆØ! QÑ&Ð!ð× Ò ˜KÑ (Ô (ð'ðÐ5Ñ5ÐJ°yÐJÀ&ÐJÐJÐJrcóÆ‡—eZdZdZddgZdZdZˆfd„Z ddee d ee eee ee efd eedefd„Zd „Zd„Zed„¦«ZˆxZS)ÚMllamaProcessoraw Constructs a Mllama processor which wraps [`MllamaImageProcessor`] and [`PretrainedTokenizerFast`] into a single processor that inherits both the image processor and tokenizer functionalities. See the [`~MllamaProcessor.__call__`] and [`~OwlViTProcessor.decode`] for more information. The preferred way of passing kwargs is as a dictionary per modality, see usage example below. ```python from transformers import MllamaProcessor from PIL import Image processor = MllamaProcessor.from_pretrained("meta-llama/Llama-3.2-11B-Vision") processor( images=your_pil_image, text=["<|image|>If I had to write a haiku for this one"], images_kwargs = {"size": {"height": 448, "width": 448}}, text_kwargs = {"padding": "right"}, common_kwargs = {"return_tensors": "pt"}, ) ``` Args: image_processor ([`MllamaImageProcessor`]): The image processor is a required input. tokenizer ([`PreTrainedTokenizer`, `PreTrainedTokenizerFast`]): The tokenizer is a required input. Úimage_processorÚ tokenizerÚMllamaImageProcessorÚPreTrainedTokenizerFastcó•—d|_| |j¦«|_d|_| |j¦«|_|j|_|j|_t¦« ||¦«dS)Nz <|image|>z<|python_tag|>) rWÚconvert_tokens_to_idsr$Úpython_tokenÚpython_token_idrVÚ chat_templateÚsuperÚ__init__)Úselfr^r_Ú __class__s €rrhzMllamaProcessor.__init__×szø€Ø&ˆÔØ'×=Ò=¸dÔ>NÑOÔOˆÔØ,ˆÔØ(×>Ò>¸tÔ?PÑQÔQˆÔØ"Ô,ˆŒØ&Ô4ˆÔÝ ‰Œ×Ò˜¨)Ñ4Ô4Ð4Ð4Ð4rNÚimagesÚtextÚkwargsr%c ó‡—|€|€td¦«‚‰jtfd‰jji|¤Ž}|d}|d}|d} i} |²t|t¦«r|g}nDt|ttf¦«rtd„|D¦«¦«std¦«‚ˆfd „|D¦«}ˆfd „|D¦«}| dd¦«}‰j|fi|¤Ž} | | ¦«dg}|t|¦«}d „|D¦«}|¢td„|D¦«¦«r(td„|D¦«¦«std¦«‚t|¦«t|¦«krA|€td¦«‚tdt|¦«›dt|¦«›d¦«‚|8‰j|fi|¤Ž}| d¦«}| |¦«|U|Sˆfd„| dD¦«}t!||‰jjt%d„| dD¦«¦«¬¦«}|| d<| dd¦«}t'| |¬¦«}|S)a& Main method to prepare text(s) and image(s) to be fed as input to the model. This method forwards the `text` arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizerFast.__call__`] if `text` is not `None` to encode the text. To prepare the image(s), this method forwards the `images` arguments to MllamaImageProcessor's [`~MllamaImageProcessor.__call__`] if `images` is not `None`. Please refer to the docstring of the above two methods for more information. Args: images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`): The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch tensor. Both channels-first and channels-last formats are supported. text (`str`, `List[str]`, `List[List[str]]`): The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences). return_tensors (`str` or [`~utils.TensorType`], *optional*): If set, will return tensors of a particular framework. Acceptable values are: - `'tf'`: Return TensorFlow `tf.constant` objects. - `'pt'`: Return PyTorch `torch.Tensor` objects. - `'np'`: Return NumPy `np.ndarray` objects. - `'jax'`: Return JAX `jnp.ndarray` objects. Returns: [`BatchFeature`]: A [`BatchFeature`] with the following fields: - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`. - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not `None`). - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`. TODO: add aspect_ratio_ids and aspect_ratio_mask and cross_attention_mask Nz'You must specify either text or images.Útokenizer_init_kwargsÚtext_kwargsrÚ common_kwargsc3ó@K—|]}t|t¦«V—ŒdS©N)Ú isinstanceÚstr)r(Úts rú z+MllamaProcessor.__call__..s-èè€Ð=_Ð=_ÐUV½jÈÍCÑ>PÔ>PÐ=_Ð=_Ð=_Ð=_Ð=_Ð=_rzAInvalid input text. Please provide a string, or a list of stringscóD•—g|]}| ‰j¦«‘ŒSr)ÚcountrW)r(rvris €rr+z,MllamaProcessor.__call__..s(ø€ÐHÐHÐH¸a §¢¨Ô(8Ñ 9Ô 9ÐHÐHÐHrcóF•—g|]}t|‰j‰j¦«‘ŒSr)r[rVrW)r(Ú text_itemris €rr+z,MllamaProcessor.__call__..s,ø€ÐoÐoÐoÐ]fÕ+¨I°t´~ÀtÔGWÑXÔXÐoÐoÐorÚpadding_sidercó,—g|]}t|¦«‘ŒSrr?)r(Úsamples rr+z,MllamaProcessor.__call__..#s€Ð!CÐ!CÐ!C°&¥# f¡+¤+Ð!CÐ!CÐ!Crc3ó"K—|] }|dkV—ŒdS©rNr©r(Ú batch_imgs rrwz+MllamaProcessor.__call__..&s&èè€ÐDÐD i9 ’>ÐDÐDÐDÐDÐDÐDrc3ó"K—|] }|dkV—ŒdSr€rrs rrwz+MllamaProcessor.__call__..&s?èè€ðQðQØ#, ˜Q’ðQðQðQðQðQðQrzaIf a batch of text is provided, there should be either no images or at least one image per samplez@No image were provided, but there are image tokens in the promptzThe number of image token (z:) should be the same as in the number of provided images (ú)r:có:•—g|]}t|‰j¦«‘ŒSr)r8r$)r(Ú token_idsris €rr+z,MllamaProcessor.__call__..;s4ø€ð*ð*ð*ØS\Õ.¨y¸$Ô:MÑNÔNð*ð*ð*rr#c3ó4K—|]}t|¦«V—ŒdSrsr?)r(r#s rrwz+MllamaProcessor.__call__..Bs(èè€ÐQÐQ¨i3˜y™>œ>ÐQÐQÐQÐQÐQÐQr)r:r;r<rKÚreturn_tensors)ÚdataÚtensor_type)Ú ValueErrorÚ _merge_kwargsrr_Úinit_kwargsrtruÚlistÚtupleÚallÚpopÚupdaterÚanyÚsumr^rTrrDr)rirkrlÚaudioÚvideosrmÚ output_kwargsrprrqr‰Ún_images_in_textÚ_ÚencodingÚn_images_in_imagesÚimage_featuresr:r9rKrˆÚ batch_features` rÚ__call__zMllamaProcessor.__call__às…ø€ðNˆ<˜F˜NÝÐFÑGÔGÐGà*˜Ô*Ý!ð ð à"&¤.Ô"<ð ðð ð ˆ ð$ MÔ2ˆØ% oÔ6ˆ Ø% oÔ6ˆ àˆØÐÝ˜$¥Ñ$Ô$ð fØvÝ e }Ñ5Ô5ð f½#Ð=_Ð=_ÐZ^Ð=_Ñ=_Ô=_Ñ:_Ô:_ð fÝ Ð!dÑeÔeÐeØHÐHÐHÐHÀ4ÐHÑHÔHÐØoÐoÐoÐoÐjnÐoÑoÔoˆDØ—’ °Ñ5Ô5ˆAØ%t”~ dÐ:Ð:¨kÐ:Ð:ˆHØKŠK˜Ñ!Ô!Ð!à˜SÐØÐÝ(¨Ñ0Ô0ˆFØ!CÐ!C¸FÐ!CÑ!CÔ!CÐàÐÝÐDÐDÐ3CÐDÑDÔDÑDÔDð ÍSðQðQØ0@ðQñQôQñNôNð õ!ØwñôðõÐ%Ñ&Ô&#Ð.>Ñ*?Ô*?Ò?Ð?Ø>Ý$Ð%gÑhÔhÐhå$ðbµcÐ:JÑ6KÔ6KðbðbõHKðL^ñH_ôH_ðbðbðbñôððÐØ1˜TÔ1°&ÐJÐJ¸MÐJÐJˆNØ&×*Ò*¨;Ñ7Ô7ˆIØKŠK˜Ñ'Ô'Ð'ðÐ $Ð"2ð*ð*ð*ð*Ø`hÐitÔ`uð*ñ*ô*Ð&õ$PØ*Ø#Ø"Ô2ÔBÝÐQÐQ¸8ÀKÔ;PÐQÑQÔQÑQÔQð $ñ$ô$Ð ð,@ˆDÐ'Ñ(à&×*Ò*Ð+;¸TÑBÔBˆÝ$¨$¸NÐKÑKÔKˆ àÐrcó&—|jj|i|¤ŽS)zÇ This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please refer to the docstring of this method for more information. )r_Úbatch_decode©riÚargsrms rr zMllamaProcessor.batch_decodeKs€ð +ˆtŒ~Ô*¨DÐ;°FÐ;Ð;Ð;rcó&—|jj|i|¤ŽS)zÁ This method forwards all its arguments to PreTrainedTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to the docstring of this method for more information. )r_Údecoder¡s rr¤zMllamaProcessor.decodeRs€ð %ˆtŒ~Ô$ dÐ5¨fÐ5Ð5Ð5rcó^—|jj}|jj}t||zdgz¦«S)NrK)r_Úmodel_input_namesr^rŽ)riÚtokenizer_input_namesÚimage_processor_input_namess rr¦z!MllamaProcessor.model_input_namesYs7€à $¤Ô @ÐØ&*Ô&:Ô&LÐ#ÝÐ)Ð,GÑGÐKaÐJbÑbÑcÔcÐcr)NNNN)rrrÚ__doc__Ú attributesÚimage_processor_classÚtokenizer_classrhrrrrr rrrrržr r¤Úpropertyr¦Ú __classcell__)rjs@rr]r]µsø€€€€€ððð:$ [Ð1€JØ2ÐØ/€Oð5ð5ð5ð5ð5ð(,ØhlØØðiðià˜Ô$ðiðu˜YÐ(9¸4À ¼?ÈDÐQbÔLcÐcÔdÔeðiðÐ.Ô/ð ið ðiðiðiðiðV<ð<ð<ð6ð6ð6ððdðdñ„Xðdðdðdðdðdrr])r©ÚtypingrrrÚnumpyrEÚfeature_extraction_utilsrÚimage_utilsrÚprocessing_utilsr r rrÚtokenization_utils_baser rÚimage_processing_mllamarrrrr8ÚndarrayrTrur[r]rrrúr·s4ðð "Ð!à(Ð(Ð(Ð(Ð(Ð(Ð(Ð(Ð(Ð(àÐÐÐà4Ð4Ð4Ð4Ð4Ð4Ø%Ð%Ð%Ð%Ð%Ð%ØVÐVÐVÐVÐVÐVÐVÐVÐVÐVÐVÐVððððððððð9Ð8Ð8Ð8Ð8Ð8ð#ð#ð#ð#ð#˜¨Uð#ñ#ô#ð#ðððððÐ,°Eðñôðð-¨d°3¬ið-Èð-ÐQUÐVZÐ[^ÔV_ÔQ`ð-ð-ð-ð-ð`- Ø $ T¨$¨s¬)¤_Ô 5ð- àD˜”IŒð- ðð- ð ð - ð „Zð- ð- ð- ð- ð`"K Cð"K°Cð"KÀcð"KÈcð"Kð"Kð"Kð"KðJhdðhdðhdðhdðhdnñhdôhdðhdðhdðhdr