The evaluation of mathematical reasoning capabilities is essential for advancing Artificial General Intelligence (AGI). While Large Language Models (LLMs) have shown impressive performance in solving mathematical problems, existing benchmarks such as GSM8K and MATH present limitations, including narrow problem definitions with specific numbers and reliance on predetermined rules that hinder accurate assessments of reasoning and adaptability. This paper introduces the UTMath Benchmark, which robustly evaluates the models through extensive unit tests. It consists of 1,053 problems across 9 mathematical domains, with over 68 test cases per this http URL propose an innovative evaluation framework inspired by unit testing in software development, focusing on both accuracy and reliability of results. Furthermore, we introduce the Reasoning-to-Coding of Thoughts (RCoT) approach, which encourages LLMs to perform explicit reasoning before generating code, leading to generating more advanced solution and improved performance. Furthermore, we are releasing not only the UTMath benchmark but also the UTMath-Train training dataset (more than 70k samples), to support the community in further exploring mathematical reasoning.
对数学推理能力的评估对于推进通用人工智能(AGI)的发展至关重要。尽管大型语言模型(LLMs)在解决数学问题方面表现出色,但现有的基准测试如GSM8K和MATH存在局限性,包括狭窄的问题定义以及特定数字的使用,并依赖预设规则,这些都妨碍了对推理能力和适应性的准确评估。本文介绍了一个名为UTMath Benchmark的新基准,通过广泛的单元测试来全面评估模型性能。该基准包含9个数学领域的1,053个问题,每个问题超过68个测试案例,详情请参见此[链接](http://this_http_URL)。我们提出了一种受软件开发中单元测试启发的创新评估框架,重点在于结果的准确性和可靠性。此外,我们还引入了思想推理到编码(RCoT)方法,该方法鼓励LLMs在生成代码前进行显式的推理过程,从而产生更先进的解决方案并提高性能。最后,我们不仅发布UTMath基准,还包括UTMath-Train训练数据集(超过70,000个样本),以支持社区进一步探索数学推理能力的发展。
https://arxiv.org/abs/2411.07240
OpenThaiGPT 1.5 is an advanced Thai language chat model based on Qwen v2.5, finetuned on over 2,000,000 Thai instruction pairs. This report provides an engineering perspective on the model's development, capabilities, and performance. We discuss the model's architecture, training process, and key features, including multi-turn conversation support, Retrieval Augmented Generation (RAG) compatibility, and tool-calling functionality. Benchmark results demonstrate OpenThaiGPT 1.5's state-of-the-art performance on various Thai language tasks, outperforming other open-source Thai language models. We also address practical considerations such as GPU memory requirements and deployment strategies.
OpenThaiGPT 1.5 是一个基于 Qwen v2.5 的高级泰语聊天模型,经过超过 2,000,000 组泰语文本指令对的微调。此报告从工程角度提供了该模型的发展、功能和性能概述。我们将讨论该模型的架构、训练过程以及关键特性,包括多轮对话支持、检索增强生成(RAG)兼容性及工具调用功能。基准测试结果表明 OpenThaiGPT 1.5 在多种泰语任务中表现出最先进的性能,超越了其他开源泰语文本模型。此外,我们还讨论了一些实际考虑因素,如 GPU 内存需求和部署策略。
https://arxiv.org/abs/2411.07238
Language model users often issue queries that lack specification, where the context under which a query was issued -- such as the user's identity, the query's intent, and the criteria for a response to be useful -- is not explicit. For instance, a good response to a subjective query like "What book should I read next?" would depend on the user's preferences, and a good response to an open-ended query like "How do antibiotics work against bacteria?" would depend on the user's expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts. Specifically, our procedure uncovers an implicit bias towards WEIRD contexts in models' "default" responses and we find that models are not equally sensitive to following different contexts, even when they are provided in prompts.
语言模型用户经常提出缺乏明确性的查询,其中发出查询的背景——如用户的身份、查询的目的以及响应有用的评判标准——并不清晰。例如,对于一个主观性问题“我接下来应该读哪本书?”的良好回答将取决于用户的偏好,而对于一个开放性问题“抗生素如何对抗细菌?”的良好回答将取决于用户的专业知识水平。这使得评估这些查询的回应成为一个难以定义的任务,因为评估者可能会基于任意标准评判回复的质量。为了解决这一问题,我们提出了背景化评估方法,该协议通过合成地构建围绕模糊查询的背景信息,并在评估过程中提供此背景信息。我们发现,背景的存在可以1)改变从评估中得出的结论,甚至反转模型对之间的胜率;2)引导评估者减少基于表面标准(如风格)进行评判;3)为了解模型在不同背景下的行为提供了新的见解。具体而言,我们的程序揭示了模型“默认”响应中存在的隐性偏向于WEIRD(西方、受过教育、工业化的、富裕的和民主的)背景,并且我们发现即使在提示中提供不同的背景信息,模型对这些背景的敏感度也不相同。
https://arxiv.org/abs/2411.07237
Adding Object into images based on text instructions is a challenging task in semantic image editing, requiring a balance between preserving the original scene and seamlessly integrating the new object in a fitting location. Despite extensive efforts, existing models often struggle with this balance, particularly with finding a natural location for adding an object in complex scenes. We introduce Add-it, a training-free approach that extends diffusion models' attention mechanisms to incorporate information from three key sources: the scene image, the text prompt, and the generated image itself. Our weighted extended-attention mechanism maintains structural consistency and fine details while ensuring natural object placement. Without task-specific fine-tuning, Add-it achieves state-of-the-art results on both real and generated image insertion benchmarks, including our newly constructed "Additing Affordance Benchmark" for evaluating object placement plausibility, outperforming supervised methods. Human evaluations show that Add-it is preferred in over 80% of cases, and it also demonstrates improvements in various automated metrics.
将基于文本指令的物体添加到图像中的任务是语义图像编辑中的一项挑战,它需要在保留原始场景和自然融入新对象之间找到平衡。尽管已经做出了大量努力,现有的模型往往难以在这两者之间取得平衡,尤其是在复杂的场景中寻找一个合适的位置来添加对象时尤为困难。我们引入了Add-it,一种无需训练的方法,该方法扩展了扩散模型的注意力机制,以整合来自三个关键来源的信息:场景图像、文本提示和生成的图像本身。我们的加权扩展注意力机制在保持结构一致性和细节精细度的同时确保物体被自然放置。无需特定任务的微调,Add-it在真实和生成图像插入基准上均达到了最先进的结果,并且还在我们新构建的“添加可能性基准”中超越了监督方法,该基准用于评估对象放置的合理性。人类评价显示,在超过80%的情况下,用户更偏好Add-it,它也在各种自动化指标上展示了改进。
https://arxiv.org/abs/2411.07232
Image watermarking methods are not tailored to handle small watermarked areas. This restricts applications in real-world scenarios where parts of the image may come from different sources or have been edited. We introduce a deep-learning model for localized image watermarking, dubbed the Watermark Anything Model (WAM). The WAM embedder imperceptibly modifies the input image, while the extractor segments the received image into watermarked and non-watermarked areas and recovers one or several hidden messages from the areas found to be watermarked. The models are jointly trained at low resolution and without perceptual constraints, then post-trained for imperceptibility and multiple watermarks. Experiments show that WAM is competitive with state-of-the art methods in terms of imperceptibility and robustness, especially against inpainting and splicing, even on high-resolution images. Moreover, it offers new capabilities: WAM can locate watermarked areas in spliced images and extract distinct 32-bit messages with less than 1 bit error from multiple small regions - no larger than 10% of the image surface - even for small $256\times 256$ images.
图像水印方法并不专门针对处理小面积的水印区域。这限制了在实际应用场景中的使用,因为在这些场景中,图像的部分可能来自不同的来源或已经被编辑过。我们介绍了一种用于局部图像水印的深度学习模型,名为“Watermark Anything Model”(WAM)。该WAM嵌入器会不显眼地修改输入图像,而提取器则将接收的图像分割成被水印和未被水印区域,并从发现的水印区域内恢复一个或多个隐藏的信息。这些模型是在低分辨率下且没有感知约束的情况下联合训练的,然后进行后训练以增强不可察觉性和多水印处理能力。实验表明,在不可察觉性和鲁棒性方面(特别是对抗修补和拼接),WAM与最先进的方法相比具有竞争力,即使在高分辨率图像上也是如此。此外,它还提供了新的功能:WAM能够定位拼接图像中的水印区域,并从多个不超过图像表面10%的小区域内提取不同的32位信息,误差少于1比特,甚至对于小尺寸的$256\times 256$图像也适用。
https://arxiv.org/abs/2411.07231
The datasets used for Deep Neural Network training (e.g., ImageNet, MSCOCO, etc.) are often manually balanced across categories (classes) to facilitate learning of all the categories. This curation process is often expensive and requires throwing away precious annotated data to balance the frequency across classes. This is because the distribution of data in the world (e.g., internet, etc.) significantly differs from the well-curated datasets and is often over-populated with samples from common categories. The algorithms designed for well-curated datasets perform suboptimally when used to learn from imperfect datasets with long-tailed imbalances and distribution shifts. For deep models to be widely used, getting away with the costly curation process by developing robust algorithms that can learn from real-world data distribution is necessary. Toward this goal, we develop practical algorithms for Deep Neural Networks that can learn from limited and imperfect data present in the real world. These works are divided into four segments, each covering a scenario of learning from limited or imperfect data. The first part of the works focuses on Learning Generative Models for Long-Tail Data, where we mitigate the mode-collapse for tail (minority) classes and enable diverse aesthetic image generations as head (majority) classes. In the second part, we enable effective generalization on tail classes through Inductive Regularization schemes, which allow tail classes to generalize as the head classes without enforcing explicit generation of images. In the third part, we develop algorithms for Optimizing Relevant Metrics compared to the average accuracy for learning from long-tailed data with limited annotation (semi-supervised), followed by the fourth part, which focuses on the effective domain adaptation of the model to various domains with zero to very few labeled samples.
用于深度神经网络训练的数据集(例如ImageNet、MSCOCO等)通常会手动平衡各个类别以促进所有类别的学习。这一整理过程往往成本高昂,且需要丢弃珍贵的标注数据来实现类别间的频率平衡。这是因为现实世界中的数据分布(例如互联网上的数据分布)与精心策划的数据集大相径庭,并常常过度集中在常见类别样本上。设计用于处理良好策划数据集的算法在学习存在长尾不平衡和分布偏移的不完美数据集时表现不佳。为了让深度模型得以广泛应用,需要开发能够从现实世界的数据分布中学习的强大算法,从而避免昂贵的整理过程。为此,我们针对深度神经网络开发了实用算法,使它们能从实际存在的有限且不完美的数据中进行学习。这些工作分为四个部分,每一部分都涵盖了在面对有限或不完美数据时的学习场景。 第一部分内容聚焦于长尾数据生成模型的学习,其中我们缓解了少数类(即尾部类别)的模式崩溃问题,并实现了与多数类(头部类别)一样多样化的美学图像生成。第二部分中,通过归纳正则化方案有效提升了对尾部类别的泛化能力,使尾部类别能够像头部类别那样进行泛化,而无需显式地生成图像。第三部分开发了针对有限标注数据(半监督学习)的长尾数据分析的相关指标优化算法。第四部分则专注于模型的有效领域适应性,在从零到极少量标签样本的不同领域中实现模型的有效迁移和应用。
https://arxiv.org/abs/2411.07229
To enhance large language models (LLMs) for chemistry problem solving, several LLM-based agents augmented with tools have been proposed, such as ChemCrow and Coscientist. However, their evaluations are narrow in scope, leaving a large gap in understanding the benefits of tools across diverse chemistry tasks. To bridge this gap, we develop ChemAgent, an enhanced chemistry agent over ChemCrow, and conduct a comprehensive evaluation of its performance on both specialized chemistry tasks and general chemistry questions. Surprisingly, ChemAgent does not consistently outperform its base LLMs without tools. Our error analysis with a chemistry expert suggests that: For specialized chemistry tasks, such as synthesis prediction, we should augment agents with specialized tools; however, for general chemistry questions like those in exams, agents' ability to reason correctly with chemistry knowledge matters more, and tool augmentation does not always help.
为了增强大型语言模型(LLMs)在化学问题解决中的应用,已经提出了几种配备工具的LLM代理,如ChemCrow和Coscientist。然而,这些评估范围狭窄,导致在理解工具对各种化学任务的好处方面存在很大差距。为填补这一空白,我们开发了ChemAgent,这是一个改进版的基于ChemCrow的化学代理,并对其在专业化学任务和一般化学问题上的表现进行了全面评估。令人惊讶的是,ChemAgent并不总是能在没有工具的情况下超越其基础LLMs。我们的错误分析与一位化学专家共同进行,表明:对于如合成预测这样的专业化学任务,我们应该用专门的工具增强代理;然而,对于像考试中那样的普通化学问题,代理正确运用化学知识的能力更为重要,而工具增强并不总能起到帮助作用。
https://arxiv.org/abs/2411.07228
With the widespread of digital environments, reliable authentication and continuous access control has become crucial. It can minimize cyber attacks and prevent frauds, specially those associated with identity theft. A particular interest lies on keystroke dynamics (KD), which refers to the task of recognizing individuals' identity based on their unique typing style. In this work, we propose the use of pre-trained language models (PLMs) to recognize such patterns. Although PLMs have shown high performance on multiple NLP benchmarks, the use of these models on specific tasks requires customization. BERT and RoBERTa, for instance, rely on subword tokenization, and they cannot be directly applied to KD, which requires temporal-character information to recognize users. Recent character-aware PLMs are able to process both subwords and character-level information and can be an alternative solution. Notwithstanding, they are still not suitable to be directly fine-tuned for KD as they are not optimized to account for user's temporal typing information (e.g., hold time and flight time). To overcome this limitation, we propose TempCharBERT, an architecture that incorporates temporal-character information in the embedding layer of CharBERT. This allows modeling keystroke dynamics for the purpose of user identification and authentication. Our results show a significant improvement with this customization. We also showed the feasibility of training TempCharBERT on a federated learning settings in order to foster data privacy.
随着数字环境的普及,可靠的认证和持续访问控制变得至关重要。这可以最大限度地减少网络攻击并防止欺诈行为,特别是与身份盗窃相关的欺诈行为。我们特别关注击键动力学(KD),它指的是基于个人独特的打字风格来识别个体身份的任务。在本研究中,我们提出了使用预训练语言模型(PLMs)来识别此类模式的方法。虽然PLMs在多个自然语言处理基准测试中表现出高性能,但这些模型在特定任务上的应用需要进行定制化调整。例如,BERT和RoBERTa依赖于子词标记化,并不能直接应用于KD,因为KD需要字符时间信息来识别用户。最近的字符感知型PLM能够同时处理子词和字符级别的信息,可以作为一种替代解决方案。然而,它们仍然不适合直接微调用于KD,因为这些模型并未针对用户的打字时间信息(如按键持续时间和飞行时间)进行优化。为了克服这一限制,我们提出了TempCharBERT架构,在该架构中,我们在CharBERT的嵌入层中加入了字符时间信息。这使得能够对击键动力学建模以实现用户识别和认证的目的。我们的结果表明这种定制化带来了显著改进。我们还展示了在联邦学习设置下训练TempCharBERT的可行性,以此促进数据隐私保护。
https://arxiv.org/abs/2411.07224
Large video models, pretrained on massive amounts of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks. However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to reach the visual states depicted in a video. To tackle this problem, current methods use a separate vision-based inverse dynamic model trained on embodiment-specific data to map image states to actions. Gathering data to train such a model is often expensive and challenging, and this model is limited to visual settings similar to the ones in which data are available. In this paper, we investigate how to directly ground video models to continuous actions through self-exploration in the embodied environment -- using generated video states as visual goals for exploration. We propose a framework that uses trajectory level action generation in combination with video guidance to enable an agent to solve complex tasks without any external supervision, e.g., rewards, action labels, or segmentation masks. We validate the proposed approach on 8 tasks in Libero, 6 tasks in MetaWorld, 4 tasks in Calvin, and 12 tasks in iThor Visual Navigation. We show how our approach is on par with or even surpasses multiple behavior cloning baselines trained on expert demonstrations while without requiring any action annotations.
大型视频模型在大量互联网视频上进行预训练,提供了关于物体和任务的动力学及运动的丰富物理知识。然而,这些视频模型并没有以代理实体为依托,并不能描述如何执行操作来达到视频中所描绘的视觉状态。为了解决这个问题,目前的方法使用了一个单独基于视觉的逆动态模型,该模型在特定于代理的数据上进行训练,将图像状态映射到动作。收集用于训练此类模型的数据通常是昂贵且具有挑战性的,并且这种模型仅限于与数据可用设置相似的视觉环境。本文探讨了如何通过在具身环境中自我探索直接使视频模型与连续动作关联起来——使用生成的视频状态作为探索的视觉目标。我们提出了一种框架,该框架结合轨迹级别的动作生成和视频引导,使得代理能够在没有任何外部监督(例如奖励、动作标签或分割掩模)的情况下解决复杂任务。我们在Libero中的8个任务、MetaWorld中的6个任务、Calvin中的4个任务以及iThor视觉导航中的12个任务中验证了所提出的方法。我们展示了我们的方法在多个行为克隆基线之上,这些基线是在专家演示的基础上进行训练的,并且无需任何动作注释。
https://arxiv.org/abs/2411.07223
In this paper, we introduce TreeCoders, a novel family of transformer trees. We moved away from traditional linear transformers to complete k-ary trees. Transformer blocks serve as nodes, and generic classifiers learn to select the best child and route the sequence of tokens to a specific leaf. The selectors, moved outside the transformer blocks, allow for the use of a variety of architecture without further modifications. Furthermore, our proposed architecture supports sparse node activation due to the logarithmic complexity of a tree search. We validate our idea by testing a series of decoder-only tree transformers, achieving competitive results across a diverse range of language datasets. Our study demonstrates that the proposed tree transformer model outperforms a size-equivalent linear transformer model 76\% of the time over a wide range of tree architectures. Furthermore, our proposed model naturally lends itself to distributed implementation.
在这篇论文中,我们介绍了TreeCoders,这是一种新型的变换树族。我们放弃了传统的线性变压器,转而采用完整的k-叉树结构。变换块作为节点,通用分类器学习选择最佳子节点并将标记序列导向特定的叶节点。将选择器移出变换块之外,使得可以使用各种架构而不需进一步修改。此外,由于树搜索的对数复杂度,我们的提议架构支持稀疏节点激活。通过测试一系列仅含解码器的树变压器模型,我们验证了这一想法,并在多种语言数据集上取得了具有竞争力的结果。研究表明,所提出的树变换模型在广泛的树结构中76%的时间优于同等规模的线性变换模型。此外,我们的提议模型自然适用于分布式实现。
https://arxiv.org/abs/2411.07218
With the recent exhibited strength of generative diffusion models, an open research question is \textit{if images generated by these models can be used to learn better visual representations}. While this generative data expansion may suffice for easier visual tasks, we explore its efficacy on a more difficult discriminative task: clothes-changing person re-identification (CC-ReID). CC-ReID aims to match people appearing in non-overlapping cameras, even when they change their clothes across cameras. Not only are current CC-ReID models constrained by the limited diversity of clothing in current CC-ReID datasets, but generating additional data that retains important personal features for accurate identification is a current challenge. To address this issue we propose DLCR, a novel data expansion framework that leverages pre-trained diffusion and large language models (LLMs) to accurately generate diverse images of individuals in varied attire. We generate additional data for five benchmark CC-ReID datasets (PRCC, CCVID, LaST, VC-Clothes, and LTCC) and \textbf{increase their clothing diversity by \boldmath{$10$}x, totaling over \boldmath{$2.1$}M images generated}. DLCR employs diffusion-based text-guided inpainting, conditioned on clothing prompts constructed using LLMs, to generate synthetic data that only modifies a subject's clothes while preserving their personally identifiable features. With this massive increase in data, we introduce two novel strategies - progressive learning and test-time prediction refinement - that respectively reduce training time and further boosts CC-ReID performance. On the PRCC dataset, we obtain a large top-1 accuracy improvement of $11.3\%$ by training CAL, a previous state of the art (SOTA) method, with DLCR-generated data. We publicly release our code and generated data for each dataset here: \url{this https URL}.
随着生成扩散模型近期展现出的强大能力,一个开放的研究问题是**这些模型生成的图像是否可以用于学习更好的视觉表示**。虽然这种生成数据扩展可能对于较简单的视觉任务已经足够,我们探讨了它在更具挑战性的辨别任务上的有效性:换装行人重识别(CC-ReID)。CC-ReID旨在匹配出现在不同摄像头中的人员,即使他们在不同摄像头中更换了衣服也能进行匹配。当前的CC-ReID模型不仅受到现有CC-ReID数据集中服装多样性有限的限制,而且生成能够保留重要个人特征以实现准确识别的额外数据也是一个当前的挑战。为了解决这个问题,我们提出了DLCR,一个全新的数据扩展框架,它利用预训练的扩散模型和大型语言模型(LLMs)来准确地生成穿着不同服饰的人们的多样化图像。我们在五个基准CC-ReID数据集(PRCC、CCVID、LaST、VC-Clothes 和 LTCC)上生成了额外的数据,并**将这些数据集中服装的多样性提高了10倍,总共生成超过210万张图片**。DLCR采用基于扩散的文字引导式图像修复技术,以LLMs构建的服饰提示为条件来生成合成数据,仅修改主题的衣服,同时保留其个人可识别特征。通过这一大量增加的数据,我们引入了两种新的策略——渐进学习和测试时间预测改进,分别减少了训练时间和进一步提升了CC-ReID性能。在PRCC数据集上,通过使用DLCR生成的数据训练前一最佳方法(SOTA)CAL,我们获得了11.3%的top-1准确率提升。我们的代码和为每个数据集生成的数据在此公开发布:\url{this https URL}。
https://arxiv.org/abs/2411.07205
This work investigates the reproducibility of the paper 'Explaining RL decisions with trajectories'. The original paper introduces a novel approach in explainable reinforcement learning based on the attribution decisions of an agent to specific clusters of trajectories encountered during training. We verify the main claims from the paper, which state that (i) training on less trajectories induces a lower initial state value, (ii) trajectories in a cluster present similar high-level patterns, (iii) distant trajectories influence the decision of an agent, and (iv) humans correctly identify the attributed trajectories to the decision of the agent. We recover the environments used by the authors based on the partial original code they provided for one of the environments (Grid-World), and implemented the remaining from scratch (Seaquest, HalfCheetah, Breakout and Q*Bert). While we confirm that (i), (ii), and (iii) partially hold, we extend on the largely qualitative experiments from the authors by introducing a quantitative metric to further support (iii), and new experiments and visual results for (i). Moreover, we investigate the use of different clustering algorithms and encoder architectures to further support (ii). We could not support (iv), given the limited extent of the original experiments. We conclude that, while some of the claims can be supported, further investigations and experiments could be of interest. We recognise the novelty of the work from the authors and hope that our work paves the way for clearer and more transparent approaches.
这项工作研究了论文《使用轨迹解释RL决策》的可重复性。原论文提出了一种基于将代理在训练过程中遇到的具体轨迹簇归因于其决策的新方法,在可解释强化学习领域进行了探索。我们验证了该论文的主要声明,这些声明包括:(i) 在较少的轨迹上进行训练会诱导较低的初始状态值;(ii) 轨迹簇内的轨迹呈现相似的高层次模式;(iii) 远距离的轨迹会影响代理的决策;以及 (iv) 人类能够正确识别出归因于代理决策的轨迹。我们根据作者提供的一种环境(Grid-World)的部分原始代码,复现了他们使用的环境,并从零开始实现了其余环境(Seaquest、HalfCheetah、Breakout 和 Q*Bert)。虽然我们确认(i)、(ii)和(iii)部分成立,但我们通过引入一个定量指标来进一步支持(iii),并为(i)提供了新的实验和可视化结果。此外,我们还调查了不同聚类算法和编码器架构的使用情况,以进一步支持(ii)。然而,由于原实验范围有限,我们无法支持(iv)。最终结论是,尽管可以证实某些声明,但更深入的研究和实验可能会带来更多的见解。我们认可作者工作的创新性,并希望我们的工作能够为更加清晰透明的方法铺平道路。
https://arxiv.org/abs/2411.07200
Instruction-guided image editing methods have demonstrated significant potential by training diffusion models on automatically synthesized or manually annotated image editing pairs. However, these methods remain far from practical, real-life applications. We identify three primary challenges contributing to this gap. Firstly, existing models have limited editing skills due to the biased synthesis process. Secondly, these methods are trained with datasets with a high volume of noise and artifacts. This is due to the application of simple filtering methods like CLIP-score. Thirdly, all these datasets are restricted to a single low resolution and fixed aspect ratio, limiting the versatility to handle real-world use cases. In this paper, we present \omniedit, which is an omnipotent editor to handle seven different image editing tasks with any aspect ratio seamlessly. Our contribution is in four folds: (1) \omniedit is trained by utilizing the supervision from seven different specialist models to ensure task coverage. (2) we utilize importance sampling based on the scores provided by large multimodal models (like GPT-4o) instead of CLIP-score to improve the data quality. (3) we propose a new editing architecture called EditNet to greatly boost the editing success rate, (4) we provide images with different aspect ratios to ensure that our model can handle any image in the wild. We have curated a test set containing images of different aspect ratios, accompanied by diverse instructions to cover different tasks. Both automatic evaluation and human evaluations demonstrate that \omniedit can significantly outperform all the existing models. Our code, dataset and model will be available at \url{this https URL}
指令引导的图像编辑方法通过在自动合成或手动标注的图像编辑对上训练扩散模型,已经展示了显著的潜力。然而,这些方法仍远远不够实用,离实际应用还有差距。我们识别出三个主要挑战导致了这一差距:首先,现有模型受限于偏颇的合成过程而具有有限的编辑技能;其次,这些方法是用含有大量噪声和伪影的数据集训练而成的,这是因为使用了像CLIP-score这样的简单过滤方法。第三点,所有这些数据集都限制在单一低分辨率和固定长宽比内,从而限制了解决现实世界案例的能力。 本文中,我们介绍了\omniedit,这是一个全能编辑器,能够无缝处理具有任何长宽比的七种不同的图像编辑任务。我们的贡献体现在四个方面:(1) 通过利用来自七个不同专业模型的监督来确保任务覆盖面。\omniedit以此进行训练。(2) 我们使用基于大型多模态模型(如GPT-4o)提供的评分的重要性采样,而不是CLIP-score,以提高数据质量。(3) 我们提出了一种新的编辑架构称为EditNet,大幅提升了编辑成功率。(4) 我们提供了不同长宽比的图像,确保我们的模型可以处理野外的各种图像。我们整理了一个包含不同长宽比图像的数据集,并附有各种指令以覆盖不同的任务。自动评估和人工评价都表明\omniedit显著优于所有现有模型。我们的代码、数据集和模型将在 \url{这个 https URL} 提供。
https://arxiv.org/abs/2411.07199
Advances in machine learning and the growing trend towards effortless data generation in real-world systems has led to an increasing interest for data-inferred models and data-based control in robotics. It seems appealing to govern robots solely based on data, bypassing the traditional, more elaborate pipeline of system modeling through first-principles and subsequent controller design. One promising data-driven approach is the Extended Dynamic Mode Decomposition (EDMD) for control-affine systems, a system class which contains many vehicles and machines of immense practical importance including, e.g., typical wheeled mobile robots. EDMD can be highly data-efficient, computationally inexpensive, can deal with nonlinear dynamics as prevalent in robotics and mechanics, and has a sound theoretical foundation rooted in Koopman theory. On this background, this present paper examines how EDMD models can be integrated into predictive controllers for nonholonomic mobile robots. In addition to the conventional kinematic mobile robot, we also cover the complete data-driven control pipeline - from data acquisition to control design - when the robot is not treated in terms of first-order kinematics but in a second-order manner, allowing to account for actuator dynamics. Using only real-world measurement data, it is shown in both simulations and hardware experiments that the surrogate models enable high-precision predictive controllers in the studied cases. However, the findings raise significant concerns about purely data-centric approaches that overlook the underlying geometry of nonholonomic systems, showing that, for nonholonomic systems, some geometric insight seems necessary and cannot be easily compensated for with large amounts of data.
机器学习的进步以及现实系统中数据生成愈发轻松的趋势,已经引发了对机器人领域基于数据推断的模型和基于数据控制的日益关注。单纯依靠数据来控制机器人,并绕过传统的通过第一性原理进行系统建模及后续控制器设计的复杂流程,看起来非常吸引人。一种有前景的数据驱动方法是针对仿射系统的扩展动态模式分解(EDMD),这类系统包含了许多在实际应用中极为重要的车辆和机器设备,例如典型的轮式移动机器人。EDMD可以高效地利用数据、计算成本低廉,并且能够处理机器人学与力学中常见的非线性动力学问题,其理论基础源自Koopman理论。基于此背景,本文探讨了如何将EDMD模型整合到非完整移动机器人的预测控制器中。除了传统的运动学移动机器人之外,当机器人不再被单纯视为一阶运动学系统,而是以二阶方式处理、允许考虑执行器动力学时,我们还涵盖了从数据采集到控制设计的整个数据驱动控制流程。仅通过实际测量的数据,在仿真和硬件实验中均显示,代理模型能够在所研究的情况下实现高精度预测控制器。然而,这些发现也对单纯以数据为中心的方法提出了显著担忧,因为这种方法忽视了非完整系统的基本几何结构,显示出对于非完整系统而言,似乎需要一些几何见解,而这一点不能简单地通过大量数据来弥补。
https://arxiv.org/abs/2411.07192
Recent works have shown a surprising result: a small fraction of Large Language Model (LLM) parameter outliers are disproportionately important to the quality of the model. LLMs contain billions of parameters, so these small fractions, such as 0.01%, translate to hundreds of thousands of parameters. In this work, we present an even more surprising finding: Pruning as few as a single parameter can destroy an LLM's ability to generate text -- increasing perplexity by 3 orders of magnitude and reducing zero-shot accuracy to guessing. We propose a data-free method for identifying such parameters, termed super weights, using a single forward pass through the model. We additionally find that these super weights induce correspondingly rare and large activation outliers, termed super activations. When preserved with high precision, super activations can improve simple round-to-nearest quantization to become competitive with state-of-the-art methods. For weight quantization, we similarly find that by preserving the super weight and clipping other weight outliers, round-to-nearest quantization can scale to much larger block sizes than previously considered. To facilitate further research into super weights, we provide an index of super weight coordinates for common, openly available LLMs.
最近的研究揭示了一个惊人的结果:大型语言模型(LLM)中极小比例的参数异常值对模型质量具有不成比例的重要性。由于LLM包含数十亿个参数,即使是0.01%这样的微小部分也意味着有成千上万的参数。在本研究中,我们提出了一个更为惊人的发现:修剪哪怕只是一个参数都可能摧毁LLM生成文本的能力——使困惑度增加三个数量级,并将零样本准确性降至随机猜测水平。我们提出了一种无需数据的方法来识别这些参数,称为超级权重,该方法只需要模型的一次正向传递。此外,我们还发现这些超级权重会引发相应的稀有且较大的激活异常值,被称为超级激活。当以高精度保留超级激活时,可以改进简单的四舍五入量化,使其与最先进的方法相竞争。对于权重量化,我们同样发现通过保留超级权重并裁剪其他权重异常值,四舍五入的量化能够扩展到比之前考虑的大得多的块尺寸。为了促进对超级权重的进一步研究,我们为常见且公开可用的LLM提供了一个超级权重坐标索引。
https://arxiv.org/abs/2411.07191
Large language models (LLMs) prompted with text and audio represent the state of the art in various auditory tasks, including speech, music, and general audio, showing emergent abilities on unseen tasks. However, these capabilities have yet to be fully demonstrated in bioacoustics tasks, such as detecting animal vocalizations in large recordings, classifying rare and endangered species, and labeling context and behavior - tasks that are crucial for conservation, biodiversity monitoring, and the study of animal behavior. In this work, we present NatureLM-audio, the first audio-language foundation model specifically designed for bioacoustics. Our carefully curated training dataset comprises text-audio pairs spanning a diverse range of bioacoustics, speech, and music data, designed to address the challenges posed by limited annotated datasets in the field. We demonstrate successful transfer of learned representations from music and speech to bioacoustics, and our model shows promising generalization to unseen taxa and tasks. Importantly, we test NatureLM-audio on a novel benchmark (BEANS-Zero) and it sets the new state of the art (SotA) on several bioacoustics tasks, including zero-shot classification of unseen species. To advance bioacoustics research, we also open-source the code for generating training and benchmark data, as well as for training the model.
大型语言模型(LLMs)在文本和音频提示下,代表了各类听觉任务的最先进水平,包括语音、音乐和一般音频,显示出处理未见过任务的新兴能力。然而,这些能力尚未完全展示在其应用于生物声学任务中,比如在大规模录音中检测动物的声音、分类稀有和濒危物种以及标记情境和行为——这些都是保护、生物多样性监测和动物行为研究中的关键任务。在这项工作中,我们介绍了NatureLM-audio,这是首个专门设计用于生物声学领域的音-语言基础模型。我们的精心策划的训练数据集包含跨越广泛生物声学、语音和音乐数据的文字-音频对,旨在解决该领域由于标注数据有限而面临的挑战。我们展示了从音乐和语音学到的表现形式向生物声学的成功迁移,并且我们的模型在未见过的分类群和任务上表现出有希望的泛化能力。重要的是,我们在一个新颖的基准测试(BEANS-Zero)上对NatureLM-audio进行了测试,并且它在多个生物声学任务中,包括零样本分类新物种的任务中设立了新的最先进水平(SotA)。为了推进生物声学研究,我们还开源了用于生成训练和基准数据以及训练模型的代码。
https://arxiv.org/abs/2411.07186
Multi-source unsupervised domain adaptation aims to leverage labeled data from multiple source domains for training a machine learning model to generalize well on a target domain without labels. Source domain selection plays a crucial role in determining the model's performance. It relies on the similarities amongst source and target domains. Nonetheless, existing work for source domain selection often involves heavyweight computational procedures, especially when dealing with numerous source domains and the need to identify the best ones from them. In this paper, we introduce a framework for gradual fine tuning (GFT) of machine learning models on multiple source domains. We represent multiple source domains as an undirected weighted graph. We then give a new generalization error bound for GFT along any path within the graph, which is used to determine the optimal path corresponding to the optimal training order. With this formulation, we introduce three lightweight graph-routing strategies which tend to minimize the error bound. Our best strategy improves $2.3\%$ of accuracy over the state-of-the-art on Natural Language Inference (NLI) task and achieves competitive performance on Sentiment Analysis (SA) task, especially a $3.9\%$ improvement on a more diverse subset of data we use for SA.
多源无监督领域适应旨在利用多个来源领域的标注数据来训练机器学习模型,使其在没有标签的目标领域上也能有良好的泛化性能。来源域的选择对决定模型的性能起着关键作用,它依赖于源域和目标域之间的相似性。然而,现有的关于源域选择的工作通常涉及到较为沉重的计算程序,尤其是在处理众多源域并需要从中识别出最佳源域时更为明显。本文介绍了一个针对多个源域进行渐进微调(GFT)的机器学习模型框架。我们将多个源域表示为一个无向加权图,并给出了在该图中任意路径上的GFT泛化误差的新界,用于确定与最优训练顺序相对应的最佳路径。通过这种表述方式,我们引入了三种轻量级的图路由策略来尽可能地最小化误差边界。我们的最佳策略在自然语言推理(NLI)任务上提高了2.3%的准确性,并且在情感分析(SA)任务中也取得了具有竞争力的表现,在用于SA的一个更具多样性的数据子集上达到了3.9%的改进。
https://arxiv.org/abs/2411.07185
3D part segmentation is a crucial and challenging task in 3D perception, playing a vital role in applications such as robotics, 3D generation, and 3D editing. Recent methods harness the powerful Vision Language Models (VLMs) for 2D-to-3D knowledge distillation, achieving zero-shot 3D part segmentation. However, these methods are limited by their reliance on text prompts, which restricts the scalability to large-scale unlabeled datasets and the flexibility in handling part ambiguities. In this work, we introduce SAMPart3D, a scalable zero-shot 3D part segmentation framework that segments any 3D object into semantic parts at multiple granularities, without requiring predefined part label sets as text prompts. For scalability, we use text-agnostic vision foundation models to distill a 3D feature extraction backbone, allowing scaling to large unlabeled 3D datasets to learn rich 3D priors. For flexibility, we distill scale-conditioned part-aware 3D features for 3D part segmentation at multiple granularities. Once the segmented parts are obtained from the scale-conditioned part-aware 3D features, we use VLMs to assign semantic labels to each part based on the multi-view renderings. Compared to previous methods, our SAMPart3D can scale to the recent large-scale 3D object dataset Objaverse and handle complex, non-ordinary objects. Additionally, we contribute a new 3D part segmentation benchmark to address the lack of diversity and complexity of objects and parts in existing benchmarks. Experiments show that our SAMPart3D significantly outperforms existing zero-shot 3D part segmentation methods, and can facilitate various applications such as part-level editing and interactive segmentation.
3D部件分割是三维感知中一个至关重要且具有挑战性的任务,在机器人学、三维生成和三维编辑等领域发挥着重要作用。最近的方法利用强大的视觉语言模型(VLMs)进行从二维到三维的知识蒸馏,实现了零样本的3D部件分割。然而,这些方法受限于对文本提示的依赖,限制了其在大规模无标签数据集上的扩展性以及处理部件模糊性的灵活性。在此工作中,我们引入了SAMPart3D,这是一个可扩展的零样本3D部件分割框架,可以将任何3D对象分割成不同粒度下的语义部件,且无需预定义的文本提示作为部件标签集合。为了实现可扩展性,我们使用与文本无关的视觉基础模型来蒸馏出一个3D特征提取骨干网,允许其扩大到大规模无标签3D数据集上学习丰富的三维先验知识。为提高灵活性,我们将尺度条件下的部件感知三维特征进行蒸馏,以在多个粒度级别下实现3D部件分割。一旦从尺度条件的部件感知3D特征中获得分割后的部分,我们就使用VLMs根据多视角渲染结果给每个部分分配语义标签。与以前的方法相比,我们的SAMPart3D可以扩展到最近的大规模3D对象数据集Objaverse,并处理复杂且非常规的对象。此外,我们还贡献了一个新的3D部件分割基准测试来解决现有基准中物体和部件缺乏多样性和复杂性的问题。实验表明,我们的SAMPart3D显著优于现有的零样本3D部件分割方法,可以促进如部件级编辑和交互式分割等各种应用的发展。
https://arxiv.org/abs/2411.07184
Achieving robust legged locomotion on complex terrains poses challenges due to the high uncertainty in robot-environment interactions. Recent advances in bipedal and quadrupedal robots demonstrate good mobility on rugged terrains but rely heavily on sensors for stability due to low static stability from a high center of mass and a narrow base of support. We hypothesize that a multi-legged robotic system can leverage morphological redundancy from additional legs to minimize sensing requirements when traversing challenging terrains. Studies suggest that a multi-legged system with sufficient legs can reliably navigate noisy landscapes without sensing and control, albeit at a low speed of up to 0.1 body lengths per cycle (BLC). However, the control framework to enhance speed on challenging terrains remains underexplored due to the complex environmental interactions, making it difficult to identify the key parameters to control in these high-degree-of-freedom systems. Here, we present a bio-inspired vertical body undulation wave as a novel approach to mitigate environmental disturbances affecting robot speed, supported by experiments and probabilistic models. Finally, we introduce a control framework which monitors foot-ground contact patterns on rugose landscapes using binary foot-ground contact sensors to estimate terrain rugosity. The controller adjusts the vertical body wave based on the deviation of the limb's averaged actual-to-ideal foot-ground contact ratio, achieving a significant enhancement of up to 0.235 BLC on rugose laboratory terrain. We observed a $\sim$ 50\% increase in speed and a $\sim$ 40\% reduction in speed variance compared to the open-loop controller. Additionally, the controller operates in complex terrains outside the lab, including pine straw, robot-sized rocks, mud, and leaves.
实现复杂地形上的稳健腿部运动面临挑战,原因是机器人与环境之间的交互存在高度不确定性。近期的两足和四足机器人的进展表明,在崎岖地形上具有良好的机动性,但由于重心高、支撑基础狭窄导致静态稳定性低,这些机器人严重依赖传感器来保持稳定。我们假设多足机器人系统可以通过利用额外腿部提供的形态冗余来减少在挑战性地形上的传感需求。研究表明,一个拥有足够数量腿的多足系统可以在不依靠感知和控制的情况下可靠地导航噪声较大的景观,尽管其速度仅为每周期体长的0.1倍(BLC)。然而,由于复杂的环境互动使得很难识别这些高自由度系统中的关键控制参数,提高挑战性地形上的速度所需的控制系统仍未得到充分探索。在这里,我们提出了一种生物启发的垂直身体波动波作为减少环境干扰对机器人速度影响的新方法,并通过实验和概率模型对此进行了支持。最后,我们介绍了一个控制框架,该框架使用二进制脚地接触传感器监测粗糙地形上足地接触模式以估算地形粗糙度。控制器根据肢体平均的实际与理想足地接触比率偏差来调整垂直身体波动波,在实验室粗糙地形上的速度显著提高到每周期体长的0.235倍。我们观察到了大约50%的速度提升和相对于开环控制者的约40%的速度变异减少。此外,该控制器在包括松针、机器人尺寸大小的岩石、泥泞地和落叶等复杂环境中的室外场景也能有效运行。
https://arxiv.org/abs/2411.07183
Understanding and manipulating the causal generation mechanisms in language models is essential for controlling their behavior. Previous work has primarily relied on techniques such as representation surgery -- e.g., model ablations or manipulation of linear subspaces tied to specific concepts -- to intervene on these models. To understand the impact of interventions precisely, it is useful to examine counterfactuals -- e.g., how a given sentence would have appeared had it been generated by the model following a specific intervention. We highlight that counterfactual reasoning is conceptually distinct from interventions, as articulated in Pearl's causal hierarchy. Based on this observation, we propose a framework for generating true string counterfactuals by reformulating language models as Generalized Structural-equation. Models using the Gumbel-max trick. This allows us to model the joint distribution over original strings and their counterfactuals resulting from the same instantiation of the sampling noise. We develop an algorithm based on hindsight Gumbel sampling that allows us to infer the latent noise variables and generate counterfactuals of observed strings. Our experiments demonstrate that the approach produces meaningful counterfactuals while at the same time showing that commonly used intervention techniques have considerable undesired side effects.
理解并操控语言模型中的因果生成机制对于控制其行为至关重要。之前的工作主要依赖诸如表示手术的技术——例如,模型删除或操纵与特定概念相关的线性子空间——来干预这些模型。为了精确了解干预的影响,检查反事实情况是有用的——例如,如果一个给定的句子是由模型在进行特定干预后生成的,它将会是什么样子。我们强调,基于Pearl的因果层次结构,反事实推理在概念上与干预是不同的。基于这一观察,我们提出了一种通过使用Gumbel-max技巧将语言模型重新表述为广义结构方程模型的方法来生成真正的字符串反事实的框架。这使我们能够建模原始字符串及其由相同采样噪声实例产生的反事实的联合分布。我们开发了一种基于事后Gumbel抽样的算法,允许我们推断出潜在的噪声变量并生成观察到的字符串的反事实情况。我们的实验表明,该方法能够产生有意义的反事实,同时显示常用的干预技术具有相当大的不期望副作用。
https://arxiv.org/abs/2411.07180