Filter
Exclude
Time range
-
Near
๐Ÿš€ Grok Voice Think Fast 1.0 (@xAI) lands on the Pareto frontier on EVA-Bench โ€” no system in the eval beats it on accuracy without sacrificing experience, or vice versa. ๐Ÿ“Š Leaderboard: servicenow.github.io/eva/#reโ€ฆ @elonmusk #VoiceAgents #ServiceNowResearch #EVABenchย #GrokVoice #xAI

4
25
96
114,260
โญ ๐—˜๐—ฉ๐—”-๐—•๐—ฒ๐—ป๐—ฐ๐—ต ๐——๐—ฎ๐˜๐—ฎ ๐Ÿฎ.๐Ÿฌ: ๐Ÿฏ ๐——๐—ผ๐—บ๐—ฎ๐—ถ๐—ป๐˜€, ๐Ÿญ๐Ÿฎ๐Ÿญ ๐—ง๐—ผ๐—ผ๐—น๐˜€, ๐Ÿฎ๐Ÿญ๐Ÿฏ ๐—ฆ๐—ฐ๐—ฒ๐—ป๐—ฎ๐—ฟ๐—ถ๐—ผ๐˜€ We just published an article detailing the major expansion we have done to the data behind EVA-Bench. ๐Ÿ—‚๏ธ Data: huggingface.co/datasets/Servโ€ฆ ๐Ÿ“„ Article: huggingface.co/blog/ServiceNโ€ฆ #VoiceAgents #OpenSource #Data #AIResearch #ServiceNowResearch
8
13
654
13 Oct 2025
Are Firewalls All You Need, or Stronger Benchmarks? Benchmarking is critical for understanding and comparing the security of tool-calling agents. As attacks evolve and defenses adapt, researchers need consistent, realistic, and reproducible evaluation frameworks to identify true progress and avoid misleading conclusions. Several recent benchmarks, such as AgentDojo, Agent Security Bench, and InjecAgent, aim to simulate real-world attack scenarios. However, our analysis further reveals that many of these benchmarks do not model real-world situations appropriately and sometimes employ skewed metrics to gauge performance. In such cases, even weak defenses may seem deceptively effective. We highlight these limitations and fix them through our proposed standardized benchmarking best-practices. Source: arxiv.org/pdf/2510.05244 @rishika2110, @KevinKasa98, @AbhayPuri98, @GabrielHuang9, @irinarish, Graham W. Taylor, @DjDvij, @alex_lacoste_ - @ServiceNowRSRCH, @Mila_Quebec, @UMontreal, @VectorInst, @uofg #PromptInjection #LLMSecurity #AIAgents #AgentSecurity #AIEvaluation #Guardrails #CyberSecurity #AISafety #MilaQuebec #ServiceNowResearch #VectorInstitute #UniversityOfGuelph
3
16
1,520
๐Ÿ“ข Donโ€™t miss it tomorrow at #COLM2025! Our researcher Arjun Ashok will present ๐Ÿงฉ โ€œContext is Key: A Benchmark for Forecasting with Essential Textual Informationโ€โณ ๐Ÿ•“ 4:45 PM at the XTempLLMs 2025 workshop ๐Ÿ”— xtempllms.github.io/2025/proโ€ฆ #AIresearch #ServiceNowResearch #XTempLLMs #LLMs #Forecasting #TimeSeries #MachineLearning #GenerativeAI

1
7
675
๐ŸŽ‰ Itโ€™s CoLM week! The Conference on Language Modeling (CoLM 2025) kicks off tomorrow in Montrรฉal ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ Proud that ServiceNow AI Research is a main sponsor โ€” and that our team will present 5 papers on: ๐Ÿ“Š Multimodal reasoning ๐Ÿ”„ Unified AR & diffusion models ๐Ÿ” Dense retrieval ๐Ÿ›ก๏ธ AI safety ๐Ÿ“ˆ Efficient adaptation See you there! ๐Ÿš€ #COLM2025 #AIresearch #ServiceNowResearch #LLMs #GenerativeAI #MontrealAI
4
8
758
Exciting update from our ๐Ÿ’ซStarFlow project! ๐ŸŒ servicenow.github.io/StarFloโ€ฆ When we first introduced StarFlow, we showed how Visionโ€“Language Models (VLMs) can transform sketches and diagrams into structured workflows for automation. Today, weโ€™re taking it a step further: weโ€™re open-sourcing the models, dataset, and code to the community! ๐ŸŽ‰ ๐Ÿ”น Fine-Tuned Models โ€ข Llama-3.2-11B-Vision-Instruct-StarFlowย (huggingface.co/ServiceNow/Llโ€ฆ) โ€ข Pixtral-12B-2409-StarFlowย (huggingface.co/ServiceNow/Piโ€ฆ) โ€ข Qwen2.5-VL-7B-Instruct-StarFlowย (huggingface.co/ServiceNow/Qwโ€ฆ) ๐Ÿ—‚๏ธ Large & Diverse Dataset โ€ข BigDocs-Sketch2Flow (huggingface.co/datasets/Servโ€ฆ) ๐Ÿ’ป Training & Evaluation Code โ€ข Includes our custom metrics & benchmarks โ€ข Available on github.com/ServiceNow/StarFlโ€ฆ We hope these resources empower researchers and practitioners to push the boundaries of vision-language reasoning and enterprise automation. #AI #OpenSource #VisionLanguageModels #SketchToFlow #WorkflowAutomation #GenerativeAI @ServiceNowResearch
๐Ÿš€ New paper from our team at @ServiceNowRSRCH!โฃ โฃ ๐Ÿ’ซ๐’๐ญ๐š๐ซ๐…๐ฅ๐จ๐ฐ: ๐†๐ž๐ง๐ž๐ซ๐š๐ญ๐ข๐ง๐  ๐’๐ญ๐ซ๐ฎ๐œ๐ญ๐ฎ๐ซ๐ž๐ ๐–๐จ๐ซ๐ค๐Ÿ๐ฅ๐จ๐ฐ ๐Ž๐ฎ๐ญ๐ฉ๐ฎ๐ญ๐ฌ ๐…๐ซ๐จ๐ฆ ๐’๐ค๐ž๐ญ๐œ๐ก ๐ˆ๐ฆ๐š๐ ๐ž๐ฌโฃ We use VLMs to turn ๐˜ฉ๐˜ข๐˜ฏ๐˜ฅ-๐˜ฅ๐˜ณ๐˜ข๐˜ธ๐˜ฏ ๐˜ด๐˜ฌ๐˜ฆ๐˜ต๐˜ค๐˜ฉ๐˜ฆ๐˜ด and diagrams into executable workflows. ๐Ÿ–๏ธโ†’โš™๏ธโฃ โฃ ๐Ÿ”—arxiv.org/abs/2503.21889โฃ ๐Ÿ“tinyurl.com/3utdbn97โฃ #Sketch2Flow #AI #VLM
5
8
1,281
11 Jul 2025
wow, you were the face of servicenowresearch for me can't wait to see where you end up, congrats:)
2
611
๐Ÿ“ฃ๐Ÿ“ฃ๐Ÿ“ฃ We just dropped Test Split 3๏ธโƒฃ of RepLiQA โ€” our Q&A dataset built to really test LLMs on unseen, made-up content. ๐Ÿš€Great for RAG, context reasoning & in-context learning ๐Ÿš€ huggingface.co/datasets/Servโ€ฆ #ServiceNowResearch
4
12
663
Very excited to see this work coming out from #ServiceNowResearch. Can't wait to try the trained VLM in #AgentLab.
๐ŸŽ‰ Excited to introduce BigDocs! An open, transparent multimodal dataset designed for: ๐Ÿ“„ Documents ๐ŸŒ Web content ๐Ÿ–ฅ๏ธ GUI understanding ๐Ÿ‘จโ€๐Ÿ’ป Code generation from images Weโ€™re also launching BigDocs-Bench, featuring 10 tasks to test models on: โžก๏ธ Document, Web, GUI Visual reasoning โžก๏ธ Converting images into JSON, Markdown, LaTeX, SVG, and more! ๐Ÿ“œ Paper: arxiv.org/pdf/2412.04626 huggingface.co/papers/2412.0โ€ฆ ๐ŸŒ Website bigdocs.github.io/
1
1
14
539
๐Ÿšจ Preprint Alert! ๐Ÿšจ It's 12 hours before your conference deadline. Tic, toc. โฐ You're obviously last minute and need to write code for some fancy plots. ๐Ÿ“Š You counted on your coding assistant to do the heavy lifting, but it's not version-aware. ๐Ÿค–โŒ You keep hitting relentless matplotlib plot errors. ๐Ÿ› Tic, toc. Panic sets in. ๐Ÿ˜ฑ ๐Ÿš€ Introducing GitChameleon ๐ŸฆŽ Our new benchmark tests large language models (LLMs) on their ability to generate version-specific code. We curated 116 Python code completion problems, each tied to specific library versions, complete with executable unit tests. Why Does Version Awareness Matter? LLMs are great at generating code, but they often fail when library versions change. This can lead to non-functional code, wasting precious timeโ€”especially when deadlines loom! ๐Ÿ•’ The Challenge: Software libraries evolve rapidly. Matplotlib, NumPy, PyTorchโ€”you name it. If your code assistant isn't aware of version-specific changes, you could be in for a world of debugging pain. ๐Ÿ˜ฉ What GitChameleon Brings to the Table: * Version-Specific Problems: Focuses on real-world issues like deprecated functions and API updates. * Execution-Based Evaluation: Goes beyond static code analysis to test actual functionality. * Popular Libraries Covered: Matplotlib, NumPy, PyTorch, Pandas, and more. Key Findings: We tested state-of-the-art LLMs, including GPT-4o, Gemini, DeepSeekCoder v2, and others. * Performance Was Underwhelming: GPT-4o achieved a pass@10 of only 39.9%. * Error Feedback Helps Slightly: With error feedback, GPT-4o improved to 43.7%. * Low Correlation with Other Benchmarks: The correlation of GitChameleon with representative code benchmarks was low. The Spearman correlation coefficients with HumanEval, EvalPlus, and BigCodeBench-Hard split were 0.37, 0.50, and 0.36, respectively. This highlights the unique challenges in version-specific code generation. Types of Version Changes Tested: * Function Name Changes * Argument/Attribute Changes * Semantic/Behavioral Changes (avg pass@10: ~9.3% ๐Ÿ˜ฑ). * New Features/Dependencies Paper: huggingface.co/papers/2411.0โ€ฆ Code: github.com/NizarIslah/GitChaโ€ฆ Thanks to first authors @nizar_islah and Justine G, and to @irinarish, @NeuralEnsemble, @terryyuezhuo @ServiceNowResearch @MILA (yes, I did pay 3.75$ to write a long post ๐Ÿ˜›)
4
26
66
12,914
10 Dec 2023
Excited to present our #EMNLP2023 paper, PromptMix: Class Boundary Augmentation Method for Large Language Model Distillation! Iโ€™m presenting it in the East Foyer. Come say hi! paper: arxiv.org/pdf/2310.14192.pdf code: github.com/ServiceNow/Promptโ€ฆ #UWCheritonCS #ServiceNowResearch
1
7
17
3,477