Episodios

  • EDINET-Bench: LLMs on Japanese Financial Tasks
    Jun 24 2025

    The article introduces EDINET-Bench, a novel open-source Japanese financial benchmark designed to evaluate Large Language Models (LLMs) on complex financial tasks. This benchmark addresses the scarcity of challenging Japanese financial datasets for LLM evaluation, crucial for tasks like accounting fraud detection, earnings forecasting, and industry prediction. The EDINET-Bench dataset is automatically compiled from ten years of Japanese annual reports available through the Electronic Disclosure for Investors’ NETwork (EDINET). Initial evaluations indicate that even state-of-the-art LLMs perform only marginally better than logistic regression in some complex financial tasks, highlighting the need for domain-specific adaptation and further research. The project makes its dataset, benchmark construction code, and evaluation code publicly available to foster advancements in LLM applications within the financial sector.

    Más Menos
    44 m
  • AutoThink: Efficient LLM Reasoning with Adaptive Budgeting
    Jun 4 2025

    The article introduces AutoThink, an innovative approach designed to enhance the inference efficiency and accuracy of reasoning Large Language Models (LLMs). AutoThink addresses the challenge of LLMs generating excessive or insufficient reasoning tokens, which leads to computational inefficiency and suboptimal performance. This system comprises two main components: a query complexity classifier that dynamically allocates the optimal number of reasoning tokens, and a dataset of control vectors derived from "pivotal tokens" to guide the LLM's reasoning path. Experimental results demonstrate that AutoThink significantly reduces output tokens while substantially improving accuracy on complex reasoning tasks, suggesting a more strategic approach to LLM resource allocation rather than simply increasing computation.

    Más Menos
    14 m
  • System Prompt Learning for LLM Problem-Solving Strategies
    Jun 4 2025

    The article introduces System Prompt Learning (SPL), an innovative approach enabling Large Language Models (LLMs) to learn and refine problem-solving strategies through practical experience. This method addresses the current disparity where most developers lack the sophisticated system prompts that make advanced AI assistants so capable. SPL represents a "third paradigm" of LLM learning, augmenting traditional pretraining and finetuning by allowing models to classify problems, apply relevant strategies, and continuously improve these strategies over time. The system maintains a dynamic database of human-readable strategies, demonstrating significant performance improvements across various benchmarks and offering benefits like cumulative learning, transparency, and adaptability. Implemented as an open-source plugin in optillm, SPL offers a practical way to integrate this adaptive intelligence into LLM applications.

    Más Menos
    16 m
  • OpenEvolve: Open Source AlphaEvolve Implementation
    May 21 2025

    This article introduces OpenEvolve, an open-source implementation of Google DeepMind's AlphaEvolve, a system that leverages Large Language Models (LLMs) in an evolutionary framework to generate and optimize code. OpenEvolve allows users to evolve entire codebases by iteratively creating modifications using LLMs, evaluating them with automated metrics, and selecting promising solutions through an evolutionary process. The article details OpenEvolve's architecture, highlighting its key components like the Prompt Sampler and LLM Ensemble, and provides examples demonstrating its ability to achieve results comparable to AlphaEvolve in complex problems such as circle packing and function minimization, showcasing the evolution from simpler algorithms to more sophisticated solutions. It also discusses the importance of LLM performance and diversity for successful evolution and provides guidance on how to install and use the software for developing and improving algorithms.

    Más Menos
    25 m
  • PTS: Pivotal Token Search
    May 18 2025

    This paper introduces Pivotal Token Search (PTS), a novel method for improving the performance of large language models by focusing on critical decision points in their output sequences. Unlike traditional methods that treat all generated tokens equally, PTS identifies "pivotal tokens" that significantly influence the probability of a successful generation. By using a binary search algorithm to pinpoint these key tokens, PTS generates preference pairs specifically centered on these critical decisions, leading to a more efficient learning signal during training. The release includes an open-source implementation, datasets of pivotal tokens and preference pairs, and fine-tuned models demonstrating the technique's effectiveness. This approach has potential applications in improving reasoning abilities, agent trajectories, and model interpretability.

    Más Menos
    11 m
  • CameraBench: Understanding Video Motion
    Apr 28 2025

    This episode introduces CameraBench, a large-scale dataset and benchmark designed to improve camera motion understanding in videos. It details a taxonomy of camera motion primitives developed with cinematographers, highlighting how motions can relate to scene content like tracking subjects. The authors describe a rigorous annotation framework and human study demonstrating how domain expertise and training enhance annotation accuracy. Using CameraBench, they evaluate both Structure-from-Motion (SfM) and Video-Language Models (VLMs), finding that SfM struggles with semantic primitives while VLMs struggle with precise geometric motions. Finally, they show that fine-tuning a generative VLM on CameraBench significantly improves performance on tasks like motion-augmented captioning and video question answering.

    Más Menos
    15 m
  • Step1X-Edit: General Image Editing Framework
    Apr 25 2025

    This epidsode introduces Step1X-Edit, an open-source image editing model designed to close the performance gap with proprietary models like GPT-4o. The developers created a large-scale, high-quality dataset and a new benchmark (GEdit-Bench) reflecting real-world editing instructions to train and evaluate the model. Step1X-Edit integrates a Multimedia Large Language Model (MLLM) with a diffusion-based image decoder to perform diverse edits based on natural language instructions. Experimental results indicate that Step1X-Edit outperforms existing open-source models and achieves performance comparable to leading closed-source systems.

    Más Menos
    21 m