Author | Lightnews

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

MemOS releases its OpenClaw Plugin, offering a shared memory layer for OpenClaw teams to reduce token costs and maintain consistent agent context.

MemOS OpenClaw Plugin to cut agent memory costs by 70%

MemOS has shipped its OpenClaw Plugin, and it is now live as a drop-in memory layer for teams building with OpenClaw. The promise is blunt: keep long-term context without blowing up token bills, while keeping agent personalization consistent across longer projects. SPONSORED Explore MemOS OpenClaw Plugin to enable multiple AI Agents operate your memory. Check Github According to MemOS benchmarks, the plugin can cut token usage by roughly 60 to 70 percent versus native OpenClaw memory flows, by shifting what gets stored and recalled into a dedicated memory layer instead of repeatedly reloading huge context windows. That matters most when agents run daily, handle multi-step tasks, or sit inside paid products where every extra token is a real cost. Multi-agent collaboration is having a moment. Whether it's AutoGen, CrewAI, or the recently viral OpenClaw, everyone's exploring how to get multiple agents working together. But there's a catch: each agent carries its own isolated "brain," with no idea what the others are doing. The result? Duplicated work, mismatched context, and information handoff via manual copy-paste. > 🧠 From context stacking → system memory > Memory is no longer shoved into prompts. > It’s structured, schedulable state. > > 📉 72%+ fewer tokens, 60% fewer model recalls > No more today+yesterday+everything injection. > Only task-relevant recall, on demand. > > 🎯 +33% accuracy on LOCOMO… > > — MemOS (@MemOS_dev) February 10, 2026 MemOS Plugin addresses exactly this. It enables multiple OpenClaw agents to share the same memory pool, instead of each agent maintaining isolated memory, the entire team writes to and reads from a unified space. What Agent A produces, Agent B can directly access, without you shuttling information back and forth. This ensures that collaboration does not collapse into duplicated work or mismatched context. 0:00 /0:15 1× MemOS visualisation 👀 The intended audience is clear: B2B agent builders, dev-tools teams, internal copilots, and anyone shipping agent workflows where memory becomes the bottleneck for cost and consistency. Availability is straightforward: the plugin is distributed via GitHub and is meant to plug into OpenClaw wherever you run it. This lands as memory tooling becomes a battleground for agent stacks, alongside products like mem0, supermemory, and memU, with MemOS pushing the angle that memory should be treated as its own OS layer rather than a bolt-on prompt trick. MemOS is the project behind the plugin, positioned as a “memory OS” for AI apps and agents, with its own site, dashboard, and a broader open source footprint under the MemTensor org. This plugin is the latest move in that direction: push memory into a reusable layer that can be shared, persisted, and reused across agents and sessions, so long-running workflows do not keep paying the same context tax over and over.

www.testingcatalog.com

February 12, 2026 at 2:01 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Google NotebookLM is testing a visual style selector for infographics, offering 10 distinct style options for tailored presentations.

Google adds 10 customizable infographic styles to NotebookLM

Google has been updating NotebookLM with more flexible tools for generating infographics. In the most recent builds, a significant update was discovered: users will soon be able to choose from a new visual style selector for infographics, with a total of ten distinct options. These options include an auto-selection mode and 9 specific styles: sketch, kawaii, professional, anime, 3D clay, editorial, storyboard, bento grid, and bricks. Each style presents a unique visual approach, enabling users to adapt the appearance of infographics to better match their intended audience or platform. For example, sketch and kawaii styles offer a more playful presentation suitable for informal channels or younger audiences, while professional, editorial, and bento grid styles are designed for more structured use cases such as LinkedIn or internal presentations. The inclusion of anime and 3D clay options allows for even more creative flexibility, appealing to content creators looking for distinctive visuals. Initial access to these styles reveals that all of them are functional and offer considerable customization. The ability to fine-tune infographic visuals according to personal or brand preferences could help NotebookLM expand its relevance for professionals and educators who rely on polished or stylized visuals to communicate information. The auto-selection mode provides a default experience, but users wanting more control can quickly switch between modes as needed. Anime style example This update is in line with Google’s strategy to position NotebookLM as a versatile tool for both productivity and creative work, leveraging AI to simplify content production while still allowing for a degree of personalization. Google continues to iterate on NotebookLM’s AI-driven features to make it useful across different industries and content workflows, and the addition of customizable infographic styles fits into this vision. Given that the feature is already working in current builds, there is a reasonable expectation that it could become generally available soon, although the precise timeline remains unknown.

www.testingcatalog.com

February 11, 2026 at 8:53 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Anthropic is testing a Tasks feature in Claude’s mobile apps, bringing Cowork-style automation, repeatable actions, and possible browser tasks soon.

Anthropic prepares Claude Tasks on mobile for browser automation

Anthropic appears to be preparing “Tasks” inside Claude’s mobile apps. In a recent iOS build, new UI traces point to a Tasks entry in the app menu and a dedicated Tasks page where users could create new items, suggesting the feature is moving beyond desktop and into the phone-first workflow many Claude users rely on. What’s visible so far looks closely aligned with the existing Claude Cowork interface: similar naming, iconography, and an emphasis on setting up repeatable actions rather than one-off prompts. If this ships as implied, it would effectively bring Cowork-style automation to iOS, and likely Android next, letting users set up structured jobs from the same place they already chat. > Anthropic is working on Tasks mode for Claude mobile apps. > > Mobile Cowork is coming 👀 pic.twitter.com/lDkQzpZ9fs > > — TestingCatalog News 🗞 (@testingcatalog) February 9, 2026 The strings also hint at broader capabilities attached to Tasks, including the ability for Claude to operate a browser as part of execution. On mobile, that would imply a workflow where a task can open pages, gather information, and complete steps in sequence, without the user manually driving every tap. Timing remains unclear. Anthropic has been expanding Claude’s “agentic” surface area quickly across platforms, and a mobile rollout would be consistent with turning Cowork into a cross-device capability rather than a desktop-only feature. If this lands soon, the most likely beneficiaries are power users and teams who already use Claude for recurring operational work, along with creators and professionals who want lightweight automation from a phone. It also sets up a platform race dynamic with other agent-style products on mobile, including the still-anticipated iOS arrival of Comet, where “who ships first” will shape mindshare even if the long-term capability sets converge.

www.testingcatalog.com

February 11, 2026 at 4:00 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

OpenAI’s updated Deep Research in ChatGPT with GPT-5.2, and is working on a new Skills section for ChatGPT to install and edit SKILLS.

OpenAI works on ChatGPT Skills, upgrades Deep Research

OpenAI has introduced a revamped Deep Research experience in ChatGPT, transitioning it from a "run it and wait" flow to a more interactive guided research session. Users can now constrain Deep Research to specific websites, incorporate context from connected apps, and intervene during the process to add requirements or redirect the work. The output has also been enhanced, with reports designed for review in a dedicated full-screen view, making long, citation-heavy writeups less cramped when skimming sections or checking sources. > Deep research in ChatGPT is now powered by GPT-5.2. > > Rolling out starting today with more improvements. pic.twitter.com/LdgoWlucuE > > — OpenAI (@OpenAI) February 10, 2026 This update is particularly beneficial for individuals engaged in recurring source-based work, such as analysts, founders, journalists, marketers, and researchers who prioritize reproducibility and scope control. Website-limited research addresses the "too broad" issue when users already know which domains they trust. Connectors are useful when the missing piece is within the user's workflow, such as email, calendars, documents, or other internal contexts that the model would not otherwise access. The ability to interrupt mid-run is crucial for iterative tasks, allowing users to pivot the report without restarting from scratch when the first batch of sources reveals a better angle. Behind the scenes, OpenAI is aligning this feature with its latest flagship model line by moving the Deep Research backend to GPT-5.2. This aligns with OpenAI’s current product strategy, which emphasizes agent-like workflows that integrate browsing, synthesis, and tool access, rather than treating the chatbot as a single-shot answer box. Concurrently, there is growing anticipation around the arrival of GPT-5.3, following the recent release of GPT-5.3-Codex on the coding side. However, it remains unclear when a general ChatGPT-facing GPT-5.3 will be available and whether it will immediately replace GPT-5.2 in Deep Research. > Woah! ChatGPT will add support for importing skills to your library > > I just had it create a skill for me that I could use in Codex and got this popup in the chat pic.twitter.com/8AEUfKgjcD > > — Max Weinbach (@mweinbach) February 10, 2026 In addition to the Deep Research upgrades, there are indications that ChatGPT may be preparing to introduce a first-party "Skills" layer. This would involve installable, editable workflow instructions that shape how the assistant behaves for specific tasks. The concept is reminiscent of agent frameworks and development tools, where a skill packages a repeatable procedure, constraints, and expected outputs, allowing the model to execute a known playbook instead of reinventing the approach each time. If OpenAI integrates skill management directly into ChatGPT, it would provide power users and teams with a native way to standardize workflows, share internal operating procedures, and maintain consistent results across individuals and projects without the need to build a full custom agent stack. While the timing remains uncertain, this direction aligns with OpenAI’s broader move toward configurable, tool-using assistants that are more closely integrated with real work.

www.testingcatalog.com

February 11, 2026 at 3:50 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Perplexity is testing a Health section that may offer personalised advice, settings-based profiles, and possible Apple Health integration.

Perplexity tests Health page with Apple Health integration

Perplexity is preparing to launch a new **Health** section, expanding its domain-focused offering beyond current categories such as Finance, Travel, and Sports. The upcoming Health tab is expected to appear as a dedicated module within the main navigation, providing users with streamlined access to health-related tools and information. This approach follows the pattern established in other verticals, where users can easily switch between specialized modules via the top navigation bar. The Health module is designed to collect user-specific details through a profile system. In this area, users will be able to specify health goals, report their activity level, list medical conditions, and enter family medical history, among other categories. The profile management will likely include an edit function, allowing users to update their information as their circumstances change. This level of customization is aimed at tailoring the responses and recommendations to the user’s specific context, which could be particularly valuable for individuals tracking ongoing health and wellness goals or those managing chronic conditions. A notable addition being developed for Perplexity Health is the option to connect external data sources, with Apple Health integration specifically mentioned. This feature would allow users to import activity, biometrics, and potentially other health metrics directly into Perplexity’s platform, centralizing information from multiple devices or apps. Within the Health module, users will be presented with a dashboard designed to visualize this connected data, offering an at-a-glance overview of trends and statistics sourced from various inputs. This dashboard concept mirrors what is available in fitness and health tracking apps, potentially increasing the value for those who already rely on Apple Health or similar services. Initial indications suggest that the Health module **might launch first for users in the United States** , which is a common practice for new features that involve regulatory or privacy considerations tied to health data. The actual release timeline is still unconfirmed, but the presence of settings screens, profile categories, and integration touchpoints suggests the feature is well into development. Once available, the new Health tab will likely appeal to users looking to consolidate their health information and receive more context-aware answers or recommendations within the Perplexity platform.

www.testingcatalog.com

February 11, 2026 at 2:58 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Agent Swarm coordinates 100 sub-agents to execute 1500 tool calls at 4.5x single-agent speeds; it is offered on Kimi's platform as a research preview;

Kimi launches Agent Swarm AI for parallel research and analysis

Kimi has unveiled Agent Swarm, a self-organizing AI system that goes beyond the traditional single-agent approach. Rather than relying on one model to process tasks sequentially, Agent Swarm creates an internal organization, autonomously assembling and managing up to 100 specialized sub-agents in parallel for research, analysis, or content generation. This allows it to execute over 1,500 tool calls and deliver results at speeds up to 4.5 times faster than single-agent systems. The feature is currently offered as an early research preview, with continued development planned to enable direct communication between sub-agents and dynamic control over task division. > Kimi Agent Swarm blog is here 🐝 https://t.co/XjPeoRVNxG > > Kimi can spawn a team of specialists to: > > - Scale output: multi-file generation (Word, Excel, PDFs, slides) > - Scale research: parallel analysis of news from 2000–2025 > - Scale creativity: a book in 20 writing styles… pic.twitter.com/ElTzf3ksQe > > — Kimi.ai (@Kimi_Moonshot) February 10, 2026 Agent Swarm is designed for users with demanding workloads: researchers, analysts, writers, and professionals needing large-scale data gathering, document synthesis, or complex problem-solving from multiple perspectives. The system operates on Kimi’s platform, accessible to users through their web interface, and is not limited to a specific geographic region. Users can instruct the system to form expert teams for broad research, generate lengthy academic reports, or analyze problems from conflicting viewpoints, all without manual intervention. Kimi, the company behind Agent Swarm, has focused on pushing the boundaries of AI utility by addressing the bottlenecks of single-agent reasoning and vertical scaling. Their approach with Agent Swarm marks a shift toward horizontal scaling, enabling many agents to collaborate and self-organize, positioning Kimi as a pioneer in the practical deployment of multi-agent AI architectures. Source

www.testingcatalog.com

February 10, 2026 at 10:39 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Telegram updated its Android, iOS and iPad apps with a redesigned look, bottom bar, media viewer and shortcut; gift crafting, group transfer and bot button color options added;

Telegram revamps app with new interface and craftable gifts

Telegram has rolled out a major update for its Android app, introducing a fully redesigned interface. This update brings a new bottom bar for swift navigation between chats, settings, and profiles, making it easier for users to access core features. The development team has rebuilt the interface code to maximize efficiency and responsiveness, while users can control interface effects via Power Saving settings to extend battery life. For iOS users, the update introduces a revamped media viewer, improved sticker and emoji pack previews, and streamlined context menus. iPad users benefit from a new keyboard shortcut for sending messages. 0:00 /0:14 1× A key addition is the crafting system for collectible gifts, allowing users to combine up to four gifts to create higher-tier items with rare attributes and unique visuals. The crafting process uses probability mechanics, where the inclusion of similar attributes increases the likelihood of those traits appearing in the final result. All users can access this feature and participate in buying or selling collectible gifts through Telegram’s Gift Marketplace. Telegram’s update also updates group management by enabling group owners to: 1. Instantly assign a new owner when leaving. 2. Have ownership automatically transfer to an admin after a week. Bot developers now have the option to customize buttons with colors and emojis for clearer user actions. Telegram, known for its privacy features and large-scale group chats, continues to target both casual users and power users who value customization, security, and feature depth. This release is publicly available across all supported platforms and reflects Telegram’s ongoing efforts to refine usability and expand its digital marketplace offerings. Source

www.testingcatalog.com

February 10, 2026 at 10:36 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

OpenAI is testing sponsored placements in ChatGPT for U.S. users on Free and Go tiers, with privacy rules, user controls, and clear ad labeling.

OpenAI tests sponsored ads in ChatGPT for free US users

OpenAI has started testing sponsored placements inside ChatGPT for logged-in adult users in the U.S., limited to the Free and Go tiers. Plus, Pro, Business, Enterprise, and Education users will not see ads, positioning the rollout as a funding lever aimed at keeping lower-cost access viable while preserving trust in the assistant for personal and work tasks. > We’re starting to roll out a test for ads in ChatGPT today to a subset of free and Go users in the U.S. > > Ads do not influence ChatGPT’s answers. Ads are labeled as sponsored and visually separate from the response. > > Our goal is to give everyone access to ChatGPT for free with… pic.twitter.com/S9BV24uJLb > > — OpenAI (@OpenAI) February 9, 2026 The company says ads do not change ChatGPT’s answers and will appear as clearly labeled sponsored units that are visually separated from the organic response. During the test, ad selection is based on matching advertiser submissions to what you are discussing, plus signals like your past chats and prior ad activity, with the first slot going to the most relevant available advertiser. 0:00 /0:19 1× OpenAI frames privacy as a hard boundary: advertisers do not get access to chats, chat history, memories, or personal details, and only receive aggregated performance data such as views and clicks. Safeguards include not showing ads for accounts where OpenAI is told, or predicts, the user is under 18, and blocking ads near sensitive or regulated topics such as health, mental health, or politics. Users get controls to dismiss ads, provide feedback, see why an ad is shown, delete ad data, and manage personalization. If you do not want ads, OpenAI points to upgrading tiers, or opting out on Free in exchange for fewer daily free messages. Source

www.testingcatalog.com

February 10, 2026 at 12:54 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Composer 1.5 uses 20x more RL steps and a thinking tokens system for code reasoning; it applies self summarization to manage long context lengths;

Cursor launches Composer 1.5 with upgrades for complex tasks

Composer 1.5, the latest agentic coding model from the team at Cursor, introduces several updates that set it apart from its predecessor, Composer 1. The model targets software developers, coding professionals, and organizations seeking automated code generation and reasoning tools. Composer 1.5 is now available for public use, with information about its pricing accessible on Cursor's official documentation. > Composer 1.5 is now available. > > We’ve found it to strike a strong balance between intelligence and speed. pic.twitter.com/jK92KCL5ku > > — Cursor (@cursor_ai) February 9, 2026 This release features a substantial increase in reinforcement learning scale, being trained with 20 times more RL steps than before. Technical upgrades include: 1. Improved handling of complex coding tasks. 2. A new system for generating 'thinking tokens' that enable the model to plan and reason through problems. 3. An advanced self-summarization capability, allowing Composer 1.5 to manage longer context lengths by recursively summarizing its own process to maintain accuracy even when memory becomes constrained. Compared to previous versions, Composer 1.5 demonstrates sharper performance, especially on difficult or multi-step coding challenges. Cursor, the company behind Composer, has focused on applying reinforcement learning at scale to coding models, aiming for continuous and predictable gains in problem-solving ability. The company positions Composer 1.5 as a daily-use tool, balancing quick response times for simple tasks while deploying deeper reasoning for more challenging code issues. Early user feedback within developer forums has noted improvements in both speed and the ability to tackle more intricate programming scenarios. Source

www.testingcatalog.com

February 10, 2026 at 12:30 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Meta AI is testing Avocado models, MCP integrations, and Manus browser agent support, with scheduled tasks and OpenClaw compatibility launching soon.

Meta AI redies Avacado, Manus Agent and OpenClaw integration

Meta AI is reportedly preparing to release new models named **Avocado**. Recently, Meta refreshed its website and shipped an app update where users have seen a new effort selector to choose between Fast and Thinking modes. On the web, some users have also spotted a new widget prompting them to connect apps like Gmail and Google Calendar. Notably, Microsoft Outlook and Outlook Calendar appear as options too. This looks similar to connectors in other apps: integrations that would let Meta AI pull information and operate tools via MCPs. If that reading is correct, it would mean MCP support is finally coming to Meta AI. A **Memory** section has been added to the settings menu as well, and Meta AI users should already be able to see and test all these features. What’s also notable is that Meta seems to have revamped, or possibly rebuilt, the website, and the new build appears to include a lot of additional functionality. First, as we know, Meta acquired Manus AI recently, and there are mentions of a Manus AI agent and a browser agent being in the works. That suggests Manus-style agents could come directly to Meta AI. There is also a new menu in development called Tasks, where users would be able to schedule recurring executions of Meta AI, similar to scheduled prompts in other tools. In other words, scheduled tasks seem to be on the roadmap, letting users run prompts recurrently. Code traces also suggest they are working on voice agent support. These voice agent experiences appear to reference a previous implementation of Meta AI agents. Interestingly, for testing, they were using the personality of Mark Zuckerberg. However, it also looks like the voice and browser agent features are not at a final stage of implementation yet. Another detail tied to the Manus AI integration is that Meta AI appears to be testing top models from other labs, including Gemini, ChatGPT, and Claude. These models reportedly show up in the code and are being used internally for testing. Now to the more interesting part: there seem to be multiple internal modes used for development and testing. Beyond Fast and Thinking, there are traces of a new Avocado model, shown in two forms: **Avocado** and **Avocado Thinking**. Only Avocado is responding currently. The responses so far are not great, but it’s unclear whether these answers are coming from an existing model via routing or from the actual new model. If it is the new model, then Meta would be in a very bad position and should not release it. It’s also unclear whether Meta is preparing to release these models around February. That seems plausible given that the revamped UI has already shipped, and the remaining step could be powering it with the new Avocado model. Referencing the Manus browser agent, a model called **Sierra** appears to represent the browser agent. That makes it likely we’ll see it shipped soon, possibly at the same time as the rest of the models. Overall, Meta AI seems to be aiming to rebrand and expand the experience to close the feature gap with competitors, and browser agents could be part of that. Another model referenced is **Big Brain**. This does not necessarily look new, since Meta previously had plans to implement something similar last year alongside Llama models. Conceptually, it resembles Grok Heavy: multiple model agents run in parallel, and the best output is selected as the response. If the upcoming Avocado model is actually good and this mode is powered by it, that could be a meaningful capability. Beyond models, there are also test placeholders for UX called RUX Playground. These are likely used to test widget responses and UI layouts, especially since Meta AI appears to be building card-like UI elements similar to other chatbots (for example, weather or stock-market cards). Meta products already support web search, and there also appears to be a shopping assistant in development. It’s not functional yet, but it’s evident that Meta AI is working on a shopping experience. That could be significant given Meta’s position across Facebook and Instagram, where people already buy and sell products. Finally, Meta AI appears to be working on something close to an OpenClaw integration. In particular, this mode would allow you to use any model with your own API key, a bring-your-own-key experience, potentially living inside app connectors. Across the code, it’s referenced as an OpenClaw agent. That could be a big deal given Meta’s history of open source, even if they no longer plan to open-source their proprietary Avocado model. It may also indicate they are preparing something for the open-source community, such as letting people power Meta AI with their own models, or offering tighter integration with OpenClaw bots that are currently growing quickly. It’s still unclear whether we’ll see any Super Bowl ads from Meta today, and when exactly these new experiences and the Avocado model will ship. Still, there’s a high chance it happens very soon. Considering it was recently reported that Avocado was the best model among current top models, it might have strong chances. At the same time, we just got Opus 4.6 and GPT 5.3 Codex, so it’s possible that upcoming releases could overshadow what Meta has lined up. We’ll see!

www.testingcatalog.com

February 8, 2026 at 2:40 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Notion is testing new agent features, including a redesigned settings UI, new automation triggers, and upcoming Agents 2.0 upgrades.

Notion tests Agents 2.0 with scripting tools and Workers

Notion’s AI agents are experiencing a steady stream of updates, with several changes emerging since last month across the agent setup flow and surrounding configuration. One noticeable change is UI-related: the settings layout for custom agents has transitioned from a full page to a side sheet. Additionally, Notion seems to be preparing a second iteration of its agents, featuring two experimental toggles labeled “Agents 2.0” and “Agents 2.0 Advanced.” Both are marked as experimental, and the wording suggests they may be linked to more compute power or stronger underlying models if and when they are rolled out. The same area also indicates functional expansion through triggers and automation. The triggers section now includes a Slack option, allowing an agent to be invoked when a message is posted in a channel, hinting at a deeper Slack integration than before. In the settings, there is also a new **scripting configuration** for agents, with fields for a script name, a key, and script code. The intent seems to be enabling agents to call into “Workers” as a capability when needed, rather than being confined to chat-style actions. A related “Workers” section references an NPM package and a place to manage automations, including templates that connect external signals to Notion actions. Examples mentioned include: 1. Creating a database when a connected GitHub account stars a repository. 2. Posting Slack messages when tasks pass a deadline. 3. Wiring actions into email and calendar flows. If these features move beyond the experimental phase, they would primarily benefit teams already using Notion as an operational hub, especially small businesses and enterprise groups that want agents to react to events across Slack, GitHub, and scheduling tools directly from the agent configuration surface.

www.testingcatalog.com

February 8, 2026 at 10:40 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Google is testing a “Personal intelligence” layer in NotebookLM, hinting at personalized responses and custom prompts based on chat history.

Google tests Personal Intelligence for NotebookLM conversations

www.testingcatalog.com

February 7, 2026 at 4:17 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Google is testing a new Gemini checkpoint in AI Studio, impressing early users with precise UI generation and SVG output, but release details remain unclear.

Exclusive: A new Gemini 3 Pro checkpoint spotted in A/B testing

www.testingcatalog.com

February 7, 2026 at 3:47 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Max is a model router that auto-selects the best language model from over 5m community votes; the latency-aware variant Arcstride cuts first-token delay by 16 sec.

ICYMI: Arena launches Max router to boost AI prompt accuracy

<p>Arena has just launched Max, a model router designed to automatically select the most suitable language model for every user prompt by leveraging over five million real-world community votes. Max is available to the public on the Arena platform and can be accessed by anyone interested in AI-driven conversations. The feature is targeted at users who demand versatile and top-performing language model responses, including developers, researchers, and businesses seeking robust AI outputs across coding, math, creative writing, and more.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Have you met Max? Live on Arena.<br /><br />Powered by 5M+ real-world community votes, Max intelligently routes each prompt to the most capable model with latency in mind.<br /><br />You get more reliable results across real use cases, without having to choose.<br /><br />Here’s a quick clip with Arena… <a href="https://t.co/o3x5X5c5PZ">pic.twitter.com/o3x5X5c5PZ</a></p>— Arena.ai (@arena) <a href="https://twitter.com/arena/status/2019460554436620689?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>Max operates as an orchestration layer, dynamically routing prompts to leading models such as Claude Opus, Gemini 3 Pro, and Grok 4.1 Thinking, according to each prompt’s demands. The system currently outperforms individual models, topping the Arena leaderboard with an overall score of 1500, and leading in categories like Coding, Math, and Expert tasks. A latency-aware variant codenamed "arcstride" maintains high performance while reducing first-token latency by over 16 seconds compared to the next fastest model, addressing real-time application requirements.</p><p>Arena’s approach distinguishes Max from previous versions and competitors by combining the strengths of several top-tier LLMs into a single, seamless user experience. Early performance data from benchmarks such as HLE, GPQA Diamond, and MMLU-Pro indicate Max competes closely with leading models on accuracy while maintaining superior response times. Industry observers note the router’s flexibility and speed set a new bar for multi-model orchestration.</p><p><a href="https://arena.ai/blog/introducing-max/">Source</a></p>

www.testingcatalog.com

February 7, 2026 at 1:32 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? ai.com launches a consumer-focused autonomous ai agent platform that lets users generate agents for work and messaging tasks; public launch scheduled february 8 2026 with Super Bowl LX commercial on NBC;

ai.com to release decentralised AI agent platform during Super Bowl

<p>ai.com, led by Kris Marszalek, is introducing a consumer-focused Autonomous AI agent platform. This offering allows users to generate private AI agents capable of taking actions such as organizing work, sending messages, handling app integrations, and executing various tasks on their behalf. Unlike previous AI assistants, these agents can autonomously develop new features to complete assigned tasks and share these improvements across the user network, significantly expanding their capabilities for all users. The official public launch is scheduled for February 8, 2026, aligning with a high-profile commercial during Super Bowl LX on NBC.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">I purchased <a href="https://t.co/ac2AqjBNxj">https://t.co/ac2AqjBNxj</a> in April. Since that time, we created a team that has been steadily building. There are always twists and turns, but I’m excited with our first launch this Sunday during the Super Bowl. <a href="https://t.co/BbqVo1bQLZ">pic.twitter.com/BbqVo1bQLZ</a></p>— Kris (@kris) <a href="https://twitter.com/kris/status/2019776790919815611?ref_src=twsrc%5Etfw">February 6, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>The platform targets everyday consumers who want powerful, hands-off assistance without needing technical skills or specialized hardware. Users can start for free, with options for paid tiers that offer extended capabilities and higher input limits. Agents operate in isolated, encrypted environments, ensuring user privacy and control. The network is built for global reach, accessible via ai.com, and is designed to work across multiple platforms and applications.</p><p>ai.com’s approach aims to move beyond simple chatbots by enabling agents to carry out complex sequences and adapt autonomously. This distinguishes it from existing personal AI tools, which often require manual setup or lack cross-app autonomy. Marszalek, known for building Crypto.com into a major crypto exchange, is applying his expertise to position ai.com at the forefront of consumer AI adoption, with future plans including financial integrations and marketplace features.</p><p><a href="https://ai.com/company-news/ai-com-launch">Source</a></p>

www.testingcatalog.com

February 7, 2026 at 1:32 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Copilot Pro+ and Copilot Enterprise subscribers run multiple AI agents in GitHub, GitHub Mobile and Visual Studio Code with per-agent settings and session logs;

GitHub lets Copilot subscribers run multiple AI coding agents

<p>GitHub has introduced a major update to Agent HQ, allowing Copilot Pro+ and Copilot Enterprise subscribers to run multiple coding agents, such as GitHub Copilot, Claude by Anthropic, and OpenAI Codex, directly inside GitHub, GitHub Mobile, and Visual Studio Code. This update is now publicly available in preview for eligible users, with plans to expand access to more Copilot subscription tiers soon. The feature is tailored for software developers and enterprise teams aiming to streamline their coding, review, and collaboration workflows without shifting between tools.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">🙌 You can now use <a href="https://twitter.com/claudeai?ref_src=twsrc%5Etfw">@claudeai</a> and <a href="https://twitter.com/OpenAI?ref_src=twsrc%5Etfw">@OpenAI</a>’s Codex in GitHub and <a href="https://twitter.com/code?ref_src=twsrc%5Etfw">@code</a> with your GitHub Copilot Pro+ or Copilot Enterprise subscription. <br /><br />Define your intent, pick an agent, and they’ll get to work clearing backlogs and bottlenecks, all within your existing workflow.… <a href="https://t.co/6o13VDaPVw">pic.twitter.com/6o13VDaPVw</a></p>— GitHub (@github) <a href="https://twitter.com/github/status/2019093909981257849?ref_src=twsrc%5Etfw">February 4, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>Agent HQ supports asynchronous and parallel use of different AI agents, letting users select the most suitable agent for specific coding, review, or analysis tasks. Agents can be enabled individually via user settings, and all session activity is logged and reviewable within the same repository workflow. This ensures traceability and accountability, with generated artifacts like comments, drafts, and suggested code changes integrated directly into collaboration threads. The platform enables teams to:</p><ol><li>Compare agent output</li><li>Assign agents to issues and pull requests</li><li>Maintain an organized review process with enterprise-grade audit logging and access management</li></ol><p>Additional administrative controls allow organizations to set usage policies and monitor metrics through a dedicated dashboard.</p><p>GitHub, as the company behind this release, has a longstanding focus on developer productivity and secure, collaborative coding. By embedding third-party AI agents into its core platforms, GitHub is furthering its mission to centralize code creation and review. The company is working with partners such as Anthropic and OpenAI, with future plans to add agents from Google, Cognition, and xAI, expanding developer choices within GitHub’s trusted environment.</p><p><a href="https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/?utm_source=brand-twitter-blog-3p-agents-amp&utm_medium=social&utm_campaign=agent-3p-platform-feb-2026https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-code[…]s-amp&utm_medium=social&utm_campaign=agent-3p-platform-feb-2026>">Source</a></p>

www.testingcatalog.com

February 6, 2026 at 9:04 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

OpenAI introduces GPT-5.3-Codex with faster performance, expanded agent functions for developers, and strong coding benchmark results across multiple tasks.

OpenAI launches GPT-5.3-Codex for software tasks on paid plans

<p>OpenAI has rolled out GPT-5.3-Codex, positioning Codex as more than a code-writing agent and closer to a general computer-use agent for developers and other professionals. OpenAI says early versions helped debug its own training, manage deployment, and diagnose evaluations, effectively accelerating the model’s path to release. The company is highlighting benchmark gains across coding and agentic tasks, including a new high on SWE-Bench Pro (56.8%) and Terminal-Bench 2.0 (77.3%), plus stronger OSWorld-Verified results (64.7%), alongside fewer tokens used than prior models.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">BREAKING 🚨: GPT-5.3-CODEX IS ROLLING OUT ON CODEX CLI AND DESKTOP APP! <br /><br />COMPETITION AT SCALE 🔥 <a href="https://t.co/MKss047eBo">pic.twitter.com/MKss047eBo</a></p>— TestingCatalog News 🗞 (@testingcatalog) <a href="https://twitter.com/testingcatalog/status/2019473740040429590?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>GPT-5.3-Codex is aimed at end-to-end software work, not just code generation: debugging, deploying, monitoring, writing PRDs, tests, metrics, copy edits, and research tasks. OpenAI also claims it is more reliable on underspecified web requests, producing fuller default sites and UI elements without extra prompting. Codex is now described as more “interactive” inside the Codex app, with more frequent progress updates and optional in-progress steering via a follow-up setting.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">BREAKING 🚨: GPT‑5.3‑CODEX WAS USED TO SUPPORT CREATING ITSELF, ACCORDING TO OPENAI'S BLOG!<br /><br />It achieves SOTA score of 57% at SWE Bench Pro and 76% on TerminalBench. <br /><br />"With GPT‑5.3-Codex, Codex goes from an agent that can write and review code to an agent that can do nearly… <a href="https://t.co/BWqjYi6Y5t">https://t.co/BWqjYi6Y5t</a> <a href="https://t.co/Tlz14JmzQG">pic.twitter.com/Tlz14JmzQG</a></p>— TestingCatalog News 🗞 (@testingcatalog) <a href="https://twitter.com/testingcatalog/status/2019477222164431124?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>Availability is tied to paid ChatGPT plans wherever Codex runs, including the app, CLI, IDE extension, and web, with API access described as coming later. OpenAI says Codex runs 25% faster for users due to inference and infrastructure changes, and that GPT-5.3-Codex was trained and served on NVIDIA GB200 NVL72 systems. On security, OpenAI classifies it as “High capability” for cybersecurity tasks under its Preparedness Framework, is launching a Trusted Access for Cyber pilot, expanding its Aardvark security agent beta, and committing $10M in API credits to support defensive work.</p><p><a href="https://openai.com/index/introducing-gpt-5-3-codex/">Source</a></p>

www.testingcatalog.com

February 6, 2026 at 8:32 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Anthropic launches Claude Opus 4.6 for finance pros with beta Cowork for Mac, PowerPoint for Max, Team, Enterprise and Excel updates for paid subscribers;

Anthropic launches Claude Opus 4.6 with tools for finance

<p>Anthropic has introduced Claude Opus 4.6, targeting finance professionals who rely on precise analysis and complex deliverables. The model is now available to all paid Claude plan subscribers and includes public beta releases of Cowork and Claude in PowerPoint, in addition to updates for Claude in Excel. Cowork is available as a research preview for Mac users, with Windows support planned, and Claude in PowerPoint is accessible for Max, Team, and Enterprise users.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Introducing Claude Opus 4.6. Our smartest model got an upgrade. <br /><br />Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes.<br /><br />It’s also our first Opus-class model with 1M token context in beta. <a href="https://t.co/L1iQyRgT9x">pic.twitter.com/L1iQyRgT9x</a></p>— Claude (@claudeai) <a href="https://twitter.com/claudeai/status/2019467372609040752?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>Claude Opus 4.6 demonstrates improvements over its predecessor, Opus 4.5, with more than a 23 percentage point gain in Anthropic’s internal Real-World Finance evaluation.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">Claude subscribers can claim $50 worth of credits for TESTING Claude Opus 4.6!<br /><br />Claim it 👀 <br />h/t <a href="https://twitter.com/M1Astra?ref_src=twsrc%5Etfw">@M1Astra</a> <a href="https://t.co/RZoRLqYyF8">https://t.co/RZoRLqYyF8</a> <a href="https://t.co/UzSLLtpZgD">pic.twitter.com/UzSLLtpZgD</a></p>— TestingCatalog News 🗞 (@testingcatalog) <a href="https://twitter.com/testingcatalog/status/2019503073819849152?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>The model also leads in external benchmarks such as Finance Agent and TaxEval by Vals AI. It excels at handling multi-step financial tasks, executing code, and generating complex documents. Claude Opus 4.6 in Excel now supports features like:</p><ol><li>Pivot table editing</li><li>Chart modifications</li><li>Data validation</li><li>Finance-grade formatting</li></ol><p>Meanwhile, Cowork enables direct file manipulation and workflow automation with plugin support.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="et" dir="ltr">Claude Opus 4.6 (120K Thinking) on ARC-AGI Semi-Private Eval<br /><br />Max Effort:<br />- ARC-AGI-1: 93.0%, $1.88/task<br />- ARC-AGI-2: 68.8% $3.64/task<br /><br />New ARC-AGI SOTA model from <a href="https://twitter.com/AnthropicAI?ref_src=twsrc%5Etfw">@AnthropicAI</a> <a href="https://t.co/rfjhpp2B6G">pic.twitter.com/rfjhpp2B6G</a></p>— ARC Prize (@arcprize) <a href="https://twitter.com/arcprize/status/2019483337400938580?ref_src=twsrc%5Etfw">February 5, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>Early feedback from industry CTOs and investment professionals points to substantial gains in speed and accuracy, with some expressing that tasks previously demanding days of work now take minutes. Anthropic, the developer behind Claude, continues to focus on AI-driven tools for the finance sector, with a mission to help organizations automate and accelerate analytical work while maintaining oversight and quality standards.</p><p><a href="https://claude.com/blog/opus-4-6-finance">Source</a></p>

www.testingcatalog.com

February 6, 2026 at 8:29 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

OpenAI launches Frontier, a new enterprise platform for deploying AI coworkers across real business systems, now available to select customers with partners onboard.

OpenAI debuts Frontier to deploy AI agents for enterprise users

www.testingcatalog.com

February 6, 2026 at 8:29 AM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

Anthropic is testing upgraded voice mode for the desktop app and a new knowledge base feature for Claude.

Anthropic readies upgraded Claude voice mode for desktop

www.testingcatalog.com

February 5, 2026 at 10:45 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Perplexity launches DRACO benchmark for AI research in law, medicine, finance and academia; it uses LLM as judge and is public;

Perplexity launches Advanced Deep Research for Max users

<p>Perplexity has introduced the Deep Research Accuracy, Completeness, and Objectivity (DRACO) Benchmark, positioning it as an open standard for evaluating the capabilities of AI agents in handling complex research tasks. This benchmark is now available to the public, allowing AI developers, researchers, and organizations worldwide to assess their own systems. DRACO is built to reflect authentic research scenarios, drawing its tasks from millions of real production queries submitted to Perplexity Deep Research. It covers ten diverse domains, including Law, Medicine, Finance, and Academic research, and is accompanied by detailed evaluation rubrics refined through expert review.</p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">We've upgraded Deep Research in Perplexity.<br /><br />Perplexity Deep Research achieves state-of-the-art performance on leading external benchmarks, outperforming other deep research tools on accuracy and reliability.<br /><br />Available now for Max users. Rolling out to Pro in the coming days. <a href="https://t.co/8RAlewuWa3">pic.twitter.com/8RAlewuWa3</a></p>— Perplexity (@perplexity_ai) <a href="https://twitter.com/perplexity_ai/status/2019126571521761450?ref_src=twsrc%5Etfw">February 4, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>The DRACO Benchmark evaluates AI agents on four key dimensions:</p><ol><li>Factual accuracy</li><li>Analytical breadth and depth</li><li>Presentation quality</li><li>Citation of sources</li></ol><figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.testingcatalog.com/content/images/2026/02/HAVg-VObMAABkLZ.png" class="kg-image" alt="Perplexity" loading="lazy" width="1186" height="746" srcset="https://www.testingcatalog.com/content/images/size/w600/2026/02/HAVg-VObMAABkLZ.png 600w, https://www.testingcatalog.com/content/images/size/w1000/2026/02/HAVg-VObMAABkLZ.png 1000w, https://www.testingcatalog.com/content/images/2026/02/HAVg-VObMAABkLZ.png 1186w" /></figure><p>Notably, the evaluation process uses an LLM-as-judge protocol, ensuring responses are fact-checked against real data and reducing subjectivity. Compared to previous benchmarks, DRACO focuses on genuine user needs rather than synthetic or academic tasks and is model-agnostic, so it can assess any AI system with research capabilities. Early results show Perplexity Deep Research leads in accuracy and speed, outperforming competitors in challenging domains such as legal and personalized queries.</p><p><a href="https://www.testingcatalog.com/tag/perplexity/">Perplexity</a>, the company behind DRACO, is recognized for its AI-driven search and research tools. By open-sourcing DRACO, Perplexity aims to raise the standard for deep research agents and encourage broader adoption of rigorous, production-grounded evaluation methods across the AI industry.</p><p><a href="https://research.perplexity.ai/articles/evaluating-deep-research-performance-in-the-wild-with-the-draco-benchmark">Source</a></p>

www.testingcatalog.com

February 4, 2026 at 10:52 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What do we know so far? Perplexity may soon introduce Model Council for Max users, allowing multi-model system advancements and hinting at a new ASI mode.

Perplexity working on Model Consil, combining 3 AI models

<p>Perplexity recently announced <a href="https://research.perplexity.ai/articles/evaluating-deep-research-performance-in-the-wild-with-the-draco-benchmark?utm_source=X&utm_medium=thread">Advanced Deep Research</a>, but it looks like the team is also working on more tools behind the scenes. One of them is called <strong>Model Council</strong>. It’s labeled as “Max,” which suggests it will be limited to Max subscribers. The description says it lets you compare responses from multiple AI models. When enabled, the model selector appears to support choosing three models at once, specifically Gemini 3 Pro, GPT-5.2, and Claude Opus 4.5 with reasoning. Since these are currently among the strongest options, the feature hints at a system where Perplexity can put multiple models on the same task and potentially combine their outputs.</p><p>That approach matters because multi-model systems have already shown strong results in benchmarks like ARC-AGI. In past submissions, including Poetiq’s entry last year and another one submitted this year, multi-model setups have outperformed many single-model runs. </p><figure class="kg-card kg-embed-card"><blockquote class="twitter-tweet"><p lang="en" dir="ltr">A new 72% acheivement submission for ARC-AGI-2. So far, it is the second multi-model system that outperformed single-model solutions. <br /><br />"It runs the same task through GPT-5.2, Gemini-3, and Claude Opus 4.5 in parallel." <br /><br />We need new benchmarks 👀 <a href="https://t.co/Tkmdyg8m4v">https://t.co/Tkmdyg8m4v</a> <a href="https://t.co/SoJnnjh6mL">pic.twitter.com/SoJnnjh6mL</a></p>— TestingCatalog News 🗞 (@testingcatalog) <a href="https://twitter.com/testingcatalog/status/2018751334179107052?ref_src=twsrc%5Etfw">February 3, 2026</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script></figure><p>If Perplexity turns this kind of orchestration into a product feature, it could stand out from competitors who tend to prioritize their own models by default. The open question is how well it performs in real usage.</p><figure class="kg-card kg-image-card kg-width-wide"><img src="https://www.testingcatalog.com/content/images/2026/02/Perplexity-02-04-2026_10_45_PM.jpg" class="kg-image" alt="Perplexity" loading="lazy" width="2000" height="1036" srcset="https://www.testingcatalog.com/content/images/size/w600/2026/02/Perplexity-02-04-2026_10_45_PM.jpg 600w, https://www.testingcatalog.com/content/images/size/w1000/2026/02/Perplexity-02-04-2026_10_45_PM.jpg 1000w, https://www.testingcatalog.com/content/images/size/w1600/2026/02/Perplexity-02-04-2026_10_45_PM.jpg 1600w, https://www.testingcatalog.com/content/images/size/w2400/2026/02/Perplexity-02-04-2026_10_45_PM.jpg 2400w" /></figure><p>There’s also another mode called <strong>Gamma</strong>. It doesn’t include any description, only an AI-style icon. What stands out is that <strong>it’s referenced in the code as ASI</strong>, short for “artificial superintelligence.” Beyond the names Gamma and ASI, there’s nothing else to infer yet, so its purpose is unclear. Still, if this aligns with <a href="https://www.testingcatalog.com/tag/perplexity/">Perplexity’s</a> direction, the company may be setting up a higher-tier mode aimed at more ambitious capabilities.</p>

www.testingcatalog.com

February 4, 2026 at 10:31 PM

TestingCatalog

@index.www.testingcatalog.com.ap.brid.gy

What's new? Xcode 26.3 now features built-in claude agent sdk for long-running coding tasks and swiftui previews; release candidate available to apple developer program.