support gemma4 use_hybrid_sliding#129
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a hybrid attention mode for Gemma4 in multimodal scenarios, allowing full attention layers to fallback to unfused attention with arbitrary masks while sliding layers utilize the configured backend with native sliding windows. The feedback recommends safely accessing the attention backend name using getattr to prevent potential AttributeErrors. Additionally, it is suggested to replace the mm_token_type_ids is None check with a more robust has_vision flag to correctly handle text-only batches that contain zero-valued token type tensors.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # In hybrid mode (backend != unfused), only full attention layers need arbitrary mask; | ||
| # sliding layers keep causal mask and use native flash sliding window. | ||
| # When backend is unfused, all layers use arbitrary mask. | ||
| if config.attention_backend.name == 'unfused' or not self.is_sliding: |
There was a problem hiding this comment.
To prevent potential AttributeError if config.attention_backend is None or does not have a name attribute (e.g., in certain testing or custom configurations), use getattr to safely retrieve the backend name.
| if config.attention_backend.name == 'unfused' or not self.is_sliding: | |
| if getattr(config.attention_backend, 'name', None) == 'unfused' or not self.is_sliding: |
| 'errors in multimodal scenarios. Please always pass pure text data.') | ||
| text_config.use_bidirectional_attention = None | ||
| else: | ||
| if config.attention_backend.name == 'unfused': |
There was a problem hiding this comment.
| self.hybrid_attention_mode = ( | ||
| text_config.use_bidirectional_attention == 'vision' and config.attention_backend.name != 'unfused') |
There was a problem hiding this comment.
| sliding_attention = sliding_attention | window_mask | ||
| # In hybrid mode, only skip sliding mask when no vision tokens (pure text batch). | ||
| # When vision tokens are present, fall back to all-unfused behavior for correctness. | ||
| use_hybrid_sliding = self.hybrid_attention_mode and mm_token_type_ids is None |
There was a problem hiding this comment.
If mm_token_type_ids is not None but contains no vision tokens (e.g., a tensor of all zeros, which is common for text-only batches in multimodal pipelines), checking mm_token_type_ids is None will incorrectly evaluate to False. This disables hybrid sliding and falls back to the slower manual sliding window mask. We should check if there are actually any vision tokens present in the batch.
has_vision = mm_token_type_ids is not None and (mm_token_type_ids > 0).any()\n use_hybrid_sliding = self.hybrid_attention_mode and not has_vision| attention_mask['sliding_attention'] = sliding_attention | ||
| attention_mask['full_attention'] = full_attention | ||
| # Signal sliding layers to override attn_mask_type to arbitrary in hybrid mode with vision | ||
| attention_mask['_sliding_override_arbitrary'] = self.hybrid_attention_mode and mm_token_type_ids is not None |
There was a problem hiding this comment.
Use the has_vision boolean flag defined earlier to avoid overriding the attention mask to arbitrary when there are no actual vision tokens in the batch.
| attention_mask['_sliding_override_arbitrary'] = self.hybrid_attention_mode and mm_token_type_ids is not None | |
| attention_mask['_sliding_override_arbitrary'] = self.hybrid_attention_mode and has_vision |
No description provided.