sciCat/content.json at main · SpectrumCat/sciCat · GitHub

1
{"meta":{"title":"SciCat","subtitle":"Ad astra per aspera","description":"","author":"Spectrum Cat","url":"http://example.com","root":"/"},"pages":[],"posts":[{"title":"【每日阅读】AIRS","slug":"【每日阅读】AIRS","date":"2025-03-08T05:40:19.000Z","updated":"2025-03-08T13:46:21.904Z","comments":false,"path":"2025/03/08/【每日阅读】AIRS/","permalink":"http://example.com/2025/03/08/%E3%80%90%E6%AF%8F%E6%97%A5%E9%98%85%E8%AF%BB%E3%80%91AIRS/","excerpt":"","text":"1.Abstract&amp;Info1.1 AbstractExisting state-of-the-art dense object detection techniques tend to produce a large number of false positive detections on difficult images with complex scenes because they focus on ensuring a high recall. To improve the detection accuracy, we propose an Adaptive Important Region Selection (AIRS) framework guided by Evidential Q-learning coupled with a uniquely designed reward function. Inspired by human visual attention, our detection model conducts object search in a top-down, hierarchical fashion. It starts from the top of the hierarchy with the coarsest granularity and then identifies the potential patches likely to contain objects of interest. It then discards non-informative patches and progressively moves downward on the selected ones for a fine-grained search. The proposed evidential Q-learning systematically encodes epistemic uncertainty in its evidential-Q value to encourage the exploration of unknown patches, especially in the early phase of model training. In this way, the proposed model dynamically balances exploration-exploitation to cover both highly valuable and informative patches. Theoretical analysis and extensive experiments on multiple datasets demonstrate that our proposed framework outperforms the SOTA models. 1.2 InfoAuthors: Dingrong Wang, Hitesh Sapkota, Qi YuDOI:Publication: Advances in Neural Information Processing SystemsPDF: PDFZotero: [PDF]Data: 2024-12-16 2. Annotation%% begin annotations %% Imported: 2025-03-08 9:37 晚上原文：[a large number of false positive detections on difficult images with complex scenes because they focus on ensuring a high recall.]标注：复杂场景中出现大量FP结果 原文：[Evidential Q-learning coupled with a uniquely designed reward function]标注：通过强化学习方法来解决 原文：[top-down, hierarchical fashion]标注：top-down搜索，层级搜索 原文：[from the top of the hierarchy with the coarsest granularity and then identifies the potential patches likely to contain objects of interest.]标注：由粗糙逐步细化 原文：[the proposed model dynamically balances exploration-exploitation to cover both highly valuable and informative patches]标注：模型动态平衡了搜索和发现 原文：[In testing, some negative anchors may generate an unusually high-quality estimation score and be selected as positive anchors (i.e., false positives) due to lack of supervision.]标注：在测试中，某些负锚可能会产生异常高质量的估计评分，并因缺乏监督而被选为正锚（即假阳性）。 原文：[on small objects in a complex background]标注：复杂背景，小物体中都会出现较多的FP 原文：[in existing onestage detectors does not capture diverse types of candidate anchors residing on the Feature Pyramid Network (FPN)]标注：所解决的问题存在于单阶段目标检测器中（两阶段是否遇到类似的问题？），问题是无法捕获足够丰富的候选框，造成许多假阳性。 原文：[Evidential Q-learning]标注：主要方法 原文：[Furthermore, in the early phase of RL agent training, AIRS also encourages the agent to explore highly uncertain patches by leveraging the epistemic uncertainty provided by our evidential Q-value. Exploration of novel patches is also dynamically balanced with the exploitation of predicted high quality region.]标注：训练早期的时候RL代理会积极鼓励搜索高度不确定性的区域以增强发现能力 原文：[an adaptive hierarchical object detection paradigm supported by an RL agent to mimic human visual attention that performs searching in the top-down fashion,]标注：RLagenet的动态分层物体检测方法 原文：[novel evidential Q-learning driven by a unique reward function, covering both potentially positive and highly uncertain patches through dynamically balancing exploitation and exploration,]标注：新的qlearning方法 原文：[theoretical guarantee on the fast convergence of the proposed evidential Q-learning algorithm,]标注：qlearning的理论保证 原文：[Deep Reinforcement Learning for Object Detection.]标注：所引用的方式最晚在2021年，其他分别在14，15，16年，相对收到更少的关注 原文：[Once a patch is selected by the RL agent, it is passed through the feature extractor followed by the recurrent neural network (RNN) to generate the state representation.]标注：步骤1：当某个patch被RL代理选中时，它将被传给RNN以获取状态表现 原文：[he state representation then goes through the evidential Q-network, which formulates an evidential Normal-inverse Gaussain (NIG) distribution and outputs the Q-value estimate for each available action.]标注：步骤2：状态表现被传给Qnet，，生成NIG分布和Q值 原文：[Then by combining with the corresponding epistemic uncertainty in the Q-value estimate, we have the evidential Q-value which balances the estimated Q-value with the (lack of) knowledge of the RL agent on the chosen action.]标注：步骤3：结合Q值估计中的认知不确定性，获取该动作（选择patch）的基于证据的Q值 原文：[Based on the evidential Q-value obtained by balancing epistemic uncertainty and estimated Q-value, the agent takes action, and then selects the next patch with the goal of maximizing the expected reward.]标注：步骤4：基于该eQ值，代理执行动作，并选择下一额patch以最大化回报 原文：[]标注： 原文：[]标注： 原文：[]标注：具体来说，每个被选择的patche与上一个状态s(t)结合输入到RNN中，得到下一个状态s(t+1) 原文：[Let D denote the size of the action space.]标注：动作空间为D 原文：[]标注：eQLearning的过程。会在后期追加 原文：[The action interaction module translates the action into the location of the next patch to be selected.]标注：动作交互模块将动作翻译为如何下一个选择的patch为hi 原文：[]标注：动作D维度的向量中，前D-1表示向下动作，最后一个代表向上动作。 原文：[For downward movement actions, we compute the reward based on the ranking of the patch selected by the movement action compared to all other patches located on the same layer in terms of the number of the positive anchors they contain.]标注：对于向下动作，会将其选择的patch与包含正锚的patch做对比，其匹配的大小作为回报 原文：[Specifically, we compute the quality measure estimate of each anchor by investigating a range of metrics: centerness [38], IoU[32], GIoU[34], and DIoU [46].]标注：匹配方法：centerness， IOU，GIoU，DIoU 原文：[In addition, we set up a penalty term with the downward movement in each time step representing the searching cost.]标注：此外，设置了一个惩罚期限，每个时间步骤的向下移动代表搜索成本。 原文：[]标注：回报公式，包括匹配分数和时间成本 原文：[For upward movement, the reward is simply set to 0,]标注：向上动作无回报，考虑到向上总是有确定结果的 %% end annotations %% %% Import Date: 2025-03-08T21:37:49.617+08:00 %%","categories":[],"tags":[{"name":"文献阅读","slug":"文献阅读","permalink":"http://example.com/tags/%E6%96%87%E7%8C%AE%E9%98%85%E8%AF%BB/"},{"name":"强化学习","slug":"强化学习","permalink":"http://example.com/tags/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0/"},{"name":"目标检测","slug":"目标检测","permalink":"http://example.com/tags/%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"}]},{"title":"【每日阅读】CP-DETR","slug":"【每日阅读】CP-DETR","date":"2025-03-05T18:25:39.000Z","updated":"2025-03-08T13:47:48.056Z","comments":false,"path":"2025/03/06/【每日阅读】CP-DETR/","permalink":"http://example.com/2025/03/06/%E3%80%90%E6%AF%8F%E6%97%A5%E9%98%85%E8%AF%BB%E3%80%91CP-DETR/","excerpt":"","text":"1.Abstract&amp;Info1.1 AbstractRecent research on universal object detection aims to introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (text-region) datasets for training. However, these methods face two main challenges: (i) how to efficiently use the prior information in the prompts to genericise objects and (ii) how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training. To address these challenges, we propose a strong universal detection foundation model called CP-DETR, which is competitive in almost all scenarios, with only one pre-training weight. Specifically, we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scale-by-scale and multi-scale fusion modules. Then, the hybrid encoder is facilitated to fully utilize the prompted information by prompt multi-label loss and auxiliary detection head. In addition to text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt, to extract abstract concepts through concrete visual examples and stably reduce alignment bias in downstream tasks. With these effective designs, CP-DETR demonstrates superior universal detection performance in a broad spectrum of scenarios. For example, our Swin-T backbone model achieves 47.6 zero-shot AP on LVIS, and the Swin-L backbone model achieves 32.2 zero-shot AP on ODinW35. Furthermore, our visual prompt generation method achieves 68.4 AP on COCO val by interactive detection, and the optimized prompt achieves 73.1 fully-shot AP on ODinW13. 1.2 InfoAuthors: Qibo Chen, Weizhong Jin, Jianyue Ge, Mengdi Liu, Yuchao Yan, Jian Jiang, Li Yu, Xuanjiang Guo, Shuchang Li, Jianzhong ChenDOI: 10.48550&#x2F;arXiv.2412.09799Publication:PDF: Preprint PDFZotero: [Preprint PDF]Data: 2024-12-13 2. Annotation%% begin annotations %% Imported: 2025-03-06 10:22 上午原文：[introduce language in a SoTA closed-set detector and then generalize the open-set concepts by constructing large-scale (textregion) datasets for training. H]标注： 原文：[how to reduce alignment bias in the downstream tasks, both leading to sub-optimal performance in some scenarios beyond pre-training]标注：当ood时，同样的任务也可能出现偏差 原文：[we design an efficient prompt visual hybrid encoder that enhances the information interaction between prompt and visual through scaleby-scale and multi-scale fusion modules.]标注：通过混合编码器增强在提示和视觉之间增强融合 原文：[fully utilize the prompted information by prompt multi-label loss]标注：多标签生成提示 原文：[auxiliary detection head.]标注：多任务 原文：[text prompts, we have designed two practical concept prompt generation methods, visual prompt and optimized prompt,]标注：多种提示，包括文本提示，视觉提示和优化提示 原文：[While using text prompts has been primarily favored in universal detection, they suffer from sub-optimal performance in downstream applications,]标注：文本提示在下游应用时，面临次等优化（sub-optimal）的情况，比不上专门优化的邻域模型 原文：[matching deficiency, where the detector produces mismatched results with the text description]标注：过去的uod方法由于检测器产生的误匹配结果造成的匹配缺陷 原文：[Objectively, text descriptions follow a long-tailed pattern and different descriptions can refer to the same image region, so it is impractical to align all the texts and image regions accurately during pretraining.]标注：客观上，文本描述是长尾的，多种描述可能对应同一区域，因此预训练时对所有文本和所有图像区域准确对其是有效的 原文：[Subjectively, it is difficult for users to accurately describe complex objects, such as specific mechanical devices, through language.]标注：主观上，用户对于某些物体是难以精准描述的 原文：[The work (Li et al. 2022b) has shown that the early fusion paradigm performs significantly better than the late fusion paradigm after eliminating alignment bias through prompt tuning in the downstream tasks.]标注：早期融合方法比晚期融合好，前提是通过提示学习在下游任务上进行微调以消除匹配偏差 原文：[Late fusion paradigms (Li et al. 2019) only use prompt vectors in the classification part, the location dependent on pre-training data distributions, which is poor in utilizing prompt information.]标注：物体检测的后融合方法中只有分类可以得益于后融合，定位还是依赖于预训练模型，这对提示信息的使用是不足的。 原文：[we believe that a key to improving the performance of universal detection lies in achieving effective cross-modal interaction between prompt and visual.]标注：提示和视觉信息的有效交互是解决通过检测的有效方法 原文：[]标注：CP-DETR的三种提示包括文本，视觉和优化。奇怪的是视觉信息的加入和优化信息的加入基本等于加入了监督信息，这和zeroshot还是有所区别的，会进一步在下文寻找其和监督学习的区别 原文：[CP-DETR, a model based on the early fusion paradigm that not only supports text prompts but also introduces visual prompts and optimized prompts to address alignment biases beyond pre-training]标注：CP-DETR不仅仅支持文本提示，也支持视觉提示和优化提示 原文：[Visual prompts avoid misalignment arising from subjective user description errors]标注：视觉提示避免了缺乏精准描述的问题 原文：[An optimized prompt provides a more direct solution by prompt tuning through downstream data annotation to align regions without changing the pre-training weights]标注：优化提示通过下游任务的标注来对其区域 原文：[concept prompts to represent these vectors in a unified way and divide the whole model into two parts: detector and concept prompt generation.]标注：两个部分，检测器和概念提示生成器 原文：[or effective cross-modal interaction, we design an efficient prompt visual hybrid encoder that updates visual and concept prompts via progressive single-scale fusion (PSF) and multi-scale fusion gating (MFG), avoiding confusion due to semantic gaps between different levels of visual features.]标注：混合编码器 原文：[Due to DETR being a sparse detector framework]标注：稀疏检测框架为什么要加辅助检测头？ 原文：[we added an auxiliary detection head and a prompt multilabel loss to facilitate the hybrid encoder to fully utilize different modal information in the interaction.]标注：我们添加了一个辅助检测头和一个及时的多标记损失，以促进混合编码器在交互中充分利用不同的模态信息。 原文：[For visual prompt, we design a visual prompt encoder that encodes the bbox as a query and adaptively aggregates concept representations from multi-scale features output by the visual backbone.]标注：visual prompt是给定bbox从backbone中特征获取的。这合理吗？bbox从那里来的？ 原文：[Text Prompted Universal Detection]标注：一些有用的方法，可以引用 原文：[]标注：我没有找到这里的box coordinate是如何提供的，如果给到背景，直觉上应该会损害检测结果，如果给到前景，则检测的意义在哪里？ Imported: 2025-03-06 10:22 上午原文：[]标注： %% end annotations %% %% Import Date: 2025-03-06T10:22:56.837+08:00 %%","categories":[],"tags":[{"name":"文献阅读","slug":"文献阅读","permalink":"http://example.com/tags/%E6%96%87%E7%8C%AE%E9%98%85%E8%AF%BB/"},{"name":"LVIS minival","slug":"LVIS-minival","permalink":"http://example.com/tags/LVIS-minival/"},{"name":"通用目标检测","slug":"通用目标检测","permalink":"http://example.com/tags/%E9%80%9A%E7%94%A8%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"}]},{"title":"【Benchmark】LVIS minival","slug":"【Benchmark】LVIS minival 开放词汇检测","date":"2025-02-21T00:33:25.000Z","updated":"2025-03-08T13:52:48.551Z","comments":false,"path":"2025/02/21/【Benchmark】LVIS minival 开放词汇检测/","permalink":"http://example.com/2025/02/21/%E3%80%90Benchmark%E3%80%91LVIS%20minival%20%E5%BC%80%E6%94%BE%E8%AF%8D%E6%B1%87%E6%A3%80%E6%B5%8B/","excerpt":"","text":"开放词汇检测 方法 在LVIS上预训练 文本来源 时间 mAP for rare Grounding DINO-L 否 BERT 2023 33.9 DetCLIP-L（无代码） 否 GPT-4 2024 45.1 Yolo-world-L 否 CLIP 2024 35.4 CP-DETR-Pro（无代码） 是 CLIP 2024 47.6 (not rare) DITO 是 CLIP 2024 40.5 LaMI-DETR 是 GPT-3.5 2024 43.4","categories":[],"tags":[{"name":"Benchmark","slug":"Benchmark","permalink":"http://example.com/tags/Benchmark/"},{"name":"LVIS minival","slug":"LVIS-minival","permalink":"http://example.com/tags/LVIS-minival/"}]},{"title":"【每日阅读】YOLO-World","slug":"【每日阅读】Yolo-World","date":"2025-02-21T00:25:14.000Z","updated":"2025-03-08T13:49:20.566Z","comments":false,"path":"2025/02/21/【每日阅读】Yolo-World/","permalink":"http://example.com/2025/02/21/%E3%80%90%E6%AF%8F%E6%97%A5%E9%98%85%E8%AF%BB%E3%80%91Yolo-World/","excerpt":"","text":"1.Abstract&amp;Info1.1 AbstractThe You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However, their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation, we introduce YOLO-World, an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically, we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset, YOLO-World achieves 35.4 AP with 52.0 FPS on V100, which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore, the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks, including object detection and open-vocabulary instance segmentation. 1.2 InfoAuthors: Tianheng Cheng, Lin Song, Yixiao Ge, Wenyu Liu, Xinggang Wang, Ying ShanDOI: 10.48550&#x2F;arXiv.2401.17270Publication:Data: 2024-02-22 2. Annotation%% begin annotations %% Imported: 2025-02-21 4:24 下午原文：[Re-parameterizable VisionLanguage Path Aggregation Network]标注：核心 原文：[for high-efficiency open-vocabulary object detection]标注：重点在高效 原文：[YOLO-World follows the standard YOLO architecture [20] and leverages the pre-trained CLIP [39] text encoder to encode the input texts.]标注：使用CLIP作为文本信息的提供方 原文：[During inference, the text encoder can be removed and the text embeddings can be re-parameterized into weights of RepVL-PAN for efficient deployment.]标注：训练时将文本信息转入RepVL-PAN，运行时不再依赖CLIP 原文：[the prompt-thendetect paradigm (Fig. 2 (c)) first encodes the prompts of a user to build an offline vocabulary and the vocabulary varies with different needs.]标注：首先将输入的文本转化为提示，同一词汇在不同场景下产生的提示可能不同 原文：[which consists of a YOLO detector, a Text Encoder, and a Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN).]标注：YoloWorld的三个部分 原文：[YOLO Detector. YOLO-World is mainly developed based on YOLOv8 [20], which contains a Darknet backbone [20, 43] as the image encoder, a path aggregation network (PAN) for multi-scale feature pyramids, and a head for bounding box regression and object embeddings.]标注：Yolo检测器构成 原文：[Given the text T , we adopt the Transformer text encoder pre-trained by CLIP [39] to extract the corresponding text embeddings W &#x3D; TextEncoder(T ) ∈ RC×D,]标注： 原文：[we adopt the decoupled head with two 3×3 convs to regress bounding boxes {bk}K k&#x3D;1 and object embeddings {ek}K k&#x3D;1,]标注：两个解耦头，分别负责分类和定位回归 原文：[We present a text contrastive head to obtain the object-text similarity sk,j by]标注：通过公式1进行解耦 原文：[During training, we construct an online vocabulary T for each mosaic sample containing 4 images]标注：以马赛克的方式进行训练 原文：[the user can define a series of custom prompts, which might include captions or categories.]标注：离线运行时使用预定的prompt进行embedding 原文：[we propose the Text-guided CSPLayer (T-CSPLayer) and Image-Pooling Attention (I-Pooling Attention)]标注：RepVL-PAN的组成 原文：[we adopt the max-sigmoid attention after the last dark bottleneck block to aggregate text features into image features by:]标注：公式2即如何将文本信息与视觉信息融合的方法 原文：[To enhance the text embeddings with image-aware information, we aggregate image features to update the text embeddings by proposing the Image-Pooling Attention. Rather than directly using crossattention on image features,]标注：公式3为如何将视觉信息融合入文本信息 原文：[LVIS minival]标注：LVIS数据集是迁移过来的，应用在minival上 原文：[Table 3. Ablations on Pre-training Data. We evaluate the zeroshot performance on LVIS of pre-training YOLO-World with different amounts of data.]标注：在添加CC3M之后情况更好，但需要额外的数据处理 原文：[Table 5. Text Encoder in YOLO-World. We ablate different text encoders in YOLO-World through the zero-shot LVIS evaluation]标注：如果采用CLIP，frozen比较好 %% end annotations %% %% Import Date: 2025-02-21T16:24:37.343+08:00 %%","categories":[],"tags":[{"name":"文献阅读","slug":"文献阅读","permalink":"http://example.com/tags/%E6%96%87%E7%8C%AE%E9%98%85%E8%AF%BB/"},{"name":"开放词汇目标检测","slug":"开放词汇目标检测","permalink":"http://example.com/tags/%E5%BC%80%E6%94%BE%E8%AF%8D%E6%B1%87%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"}]},{"title":"【每日阅读】Grounding DINO","slug":"【每日阅读】Grounding DINO","date":"2025-02-20T19:17:11.000Z","updated":"2025-03-08T13:49:26.053Z","comments":false,"path":"2025/02/21/【每日阅读】Grounding DINO/","permalink":"http://example.com/2025/02/21/%E3%80%90%E6%AF%8F%E6%97%A5%E9%98%85%E8%AF%BB%E3%80%91Grounding%20DINO/","excerpt":"","text":"1.Abstract&amp;Info1.1 AbstractIn this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO&#x2F;+&#x2F;g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \\url{https://github.com/IDEA-Research/GroundingDINO}. 1.2 InfoAuthors: Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei ZhangDOI: 10.48550&#x2F;ARXIV.2303.05499Publication:Data: 2023-01-01 2. Annotation%% begin annotations %% Imported: 2025-02-17 1:02 下午原文：[Feature fusion can be performed in three phases: neck (phase A), query initialization (phase B), and head (phase C). For example, GLIP [25] performs early fusion in the neck module (phase A), and OV-DETR [55] uses language-aware queries as head inputs (phase B). We argue that introducing more feature fusion into the pipeline can facilitate better alignment between different modality features, thereby achieving better performance.]标注：文章认为在三个部分可以加入语义信息，越多地方加就越好 原文：[Although conceptually simple, it is hard for previous work to perform feature fusion in all three phases.]标注：同时在三个部分加入比较困难 原文：[Unlike classical detectors, the Transformer-based detector method such as DINO has a consistent structure with language blocks.]标注：相比基于CNN的方法，基于DETR的方法更好，因为其结构与语言模型相似 原文：[Most existing open-set models [14, 21] rely on pre-trained CLIP models for concept generalization]标注：第一种方法，基于CLIP做概念泛化 原文：[GLIP [25] presents a different way by reformulating object detection as a phrase grounding task and introducing contrastive training between object regions and language phrases on large-scale data.]标注：第二种，将目标检测视为描述grounding任务，在物体区域和描述之间建立联系 原文：[GLIP’s approach involves concatenating all categories into a sentence in a random order. However, the direct category names concatenation does not consider the potential influence of unrelated categories on each other when extracting features.]标注：GLIP方法的缺点 原文：[To mitigate this issue and improve model performance during grounded training, we introduce a technique that utilizes sub-sentence level text features.]标注：子句级别的语义特征 原文：[]标注：三个子设置：闭集目标检测，开集目标检测和提示目标检测 原文：[]标注： 原文：[INq &#x3D; TopNq (Max(−1)(XI X⊺ T )). (1)]标注：被选取的图像特征 原文：[Each cross-modality query is fed into a self-attention layer, an image cross-attention layer to combine image features, a text crossattention layer to combine text features, and an FFN layer in each cross-modality decoder layer.]标注：跨模态解码器，对query进行处理，注入文本信息 原文：[Table 2: Zero-shot domain transfer and fine-tuning on COCO. *]标注：zeroshot COCO结果 原文：[It eliminates the influence between different category names while keeping per-word features for fine-grained understanding.]标注：移除了不必要的类别名间影响 原文：[]标注：闭集：COCO 原文：[]标注：开集：zero-shot COCO, LVIS, ODinW 原文：[]标注：提示目标检测：RefCOCO&#x2F;+&#x2F;g 原文：[Using BERT as our text encoder,]标注：文本编码器：BERT 原文：[Table 3: Model results on LVIS.]标注：LVIS结果 原文：[Table 4: Model results on the ODinW benchmark.]标注：ODinW结果 %% end annotations %% %% Import Date: 2025-02-17T13:19:36.595+08:00 %%","categories":[],"tags":[{"name":"文献阅读","slug":"文献阅读","permalink":"http://example.com/tags/%E6%96%87%E7%8C%AE%E9%98%85%E8%AF%BB/"},{"name":"开放词汇目标检测","slug":"开放词汇目标检测","permalink":"http://example.com/tags/%E5%BC%80%E6%94%BE%E8%AF%8D%E6%B1%87%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"}]}],"categories":[],"tags":[{"name":"文献阅读","slug":"文献阅读","permalink":"http://example.com/tags/%E6%96%87%E7%8C%AE%E9%98%85%E8%AF%BB/"},{"name":"强化学习","slug":"强化学习","permalink":"http://example.com/tags/%E5%BC%BA%E5%8C%96%E5%AD%A6%E4%B9%A0/"},{"name":"目标检测","slug":"目标检测","permalink":"http://example.com/tags/%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"},{"name":"LVIS minival","slug":"LVIS-minival","permalink":"http://example.com/tags/LVIS-minival/"},{"name":"通用目标检测","slug":"通用目标检测","permalink":"http://example.com/tags/%E9%80%9A%E7%94%A8%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"},{"name":"Benchmark","slug":"Benchmark","permalink":"http://example.com/tags/Benchmark/"},{"name":"开放词汇目标检测","slug":"开放词汇目标检测","permalink":"http://example.com/tags/%E5%BC%80%E6%94%BE%E8%AF%8D%E6%B1%87%E7%9B%AE%E6%A0%87%E6%A3%80%E6%B5%8B/"}]}