A list of research papers and other related resources on Vision-Language-Action/Navigation (VLA/VLN) models for UAVs.
Contributions are welcome!
-
APEX: A Decoupled Memory-based Explorer for Asynchronous Aerial Object Goal Navigation (CVPR 2026)[paper][code] (Note: Dual system; History info)
-
History-Enhanced Two-Stage Transformer for Aerial Vision-and-Language Navigation (AAAI 2026)[paper][code] (Note: Two-stage:先看大概方位,再找具体细节;历史网格地图)
-
IndoorUAV: Benchmarking Vision-Language UAV Navigation in Continuous Indoor Environments (AAAI 2026)[paper][code] (Note: Datasets: IndoorUAV-VLN(长时导航任务)和IndoorUAV-VLA(短时规划任务);IndoorUAV-Agent: 先利用 GPT-4o 对原始指令进行分段,再利用基于π0架构的VLA进行飞行控制;再进行视觉反馈以辅助下一轮推理)
-
AutoFly: Vision-Language-Action Model for UAV Autonomous Navigation in the Wild (ICLR 2026)[paper][code] (Note: Dataset, Pseudo depth encoder)
-
AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation (arXiv 2026.1)[paper][[code]] (Note: Dual system, Memory)
-
Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation (arXiv 2026.2)[paper][code] (Note: Dual system, similar to SPF)
-
USS-Nav: Unified Spatio-Semantic Scene Graph for Lightweight UAV Zero-Shot Object Navigation (arXiv 2026.2)[paper][[code]] (Note: 多面体三维空间图,语义选区域、算法走路径,高效Jetson orin nx)
-
AirNav: A Large-Scale Real-World UAV Vision-and-Language Navigation Dataset with Natural and Diverse Instructions (arXiv 2026.1)[paper][code] (Note: Dataset, AirVLN-R1, Tello)
-
AION: Aerial Indoor Object-Goal Navigation Using Dual-Policy Reinforcement Learning (arXiv 2026.1)[paper][[code]] (Note: Detection & Depth, Exploration & Exploitation, 探索+寻的)
-
Aerial World Model for Long-horizon Visual Generation and Navigation in 3D Space (arXiv 2026.1)[paper][[code]] (Note: Imaging before movement)
-
NavDreamer: Video Models as Zero-Shot 3D Navigators (arXiv 2026.2)[paper][code] (Note: 语言指令→视频生成→航点提取→轨迹规划→实际飞行)
-
EzReal: Enhancing Zero-Shot Outdoor Robot Navigation toward Distant Targets under Varying Visibility (ICRA 2026)[paper][code] (Note: Robots, Object navigation, 看轮廓-辨方向-记方向-寻方向)
-
[Review] UAVs Meet LLMs: Overviews and Perspectives Toward Agentic Low-Altitude Mobility (Information Fusion 2025.3)[paper][code]
-
See, Point, Fly: A Learning-Free VLM Framework for Universal Unmanned Aerial Navigation (CoRL 2025)[paper][code] (Note: Dual system, SPF)
-
VLA-AN: An Efficient and Onboard Vision-Language-Action Framework for Aerial Navigation in Complex Environments (arXiv 2025.12)[paper][[code]] (Note: End-to-end, 3-stage training strategy, Onboard implementation)
-
NavRL: Learning Safe Flight in Dynamic Environments (IEEE Robotics and Automation Letters, 2025.4)[paper][code] (Note: Deep RL, Using depth info)
-
ASMA: An Adaptive Safety Margin Algorithm for Vision-Language Drone Navigation via Scene-Aware Control Barrier Functions (IEEE Robotics and Automation Letters, 2025.9)[paper][code] (Note: VLN + MPC, Using depth info)
-
LongFly: Long-Horizon UAV Vision-and-Language Navigation with Spatiotemporal Context Integration (arXiv 2025.12)[paper][[code]] (Note: Using history info)
-
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation (arXiv 2025.7)[paper][code](Note: Dataset)
-
TypeFly: Low-Latency Drone Planning With Large Language Models (IEEE Transactions on Mobile Computing 2025.9) [paper][code]
-
Towards Realistic UAV Vision-Language Navigation: Platform, Benchmark, and Methodology (OpenUAV) (ICLR 2025)[paper][code]
-
MonoSpheres: Large-Scale Monocular SLAM-Based UAV Exploration through Perception-Coupled Mapping and Planning (arXiv 2025.11)[paper][code](Note: 单目SLAM与感知建图与规划的融合)
-
OpenVLN: Open-world Aerial Vision-Language Navigation (arXiv 2025.11)[paper][[code]](Note: 利用强化学习和值模型应对数据稀缺和长视域规划的双重挑战)
-
UAV-VLRR: Vision-Language Informed NMPC for Rapid Response in UAV Search and Rescue (arXiv 2025.3)[paper][code](Note: VLM + NMPC)
-
UAV-Flow Colosseo: A Real-World Benchmark for Flying-on-a-Word UAV Imitation Learning (arXiv 2025.5)[paper][code]
-
UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents (ACM MM Dataset Track 2025)[paper][code]
-
AeroDuo: Aerial Duo for UAV-based Vision and Language Navigation (ACM MM 2025)[paper][[code]]
-
Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space (ACM MM'25)[paper][code](Note: Dataset)
-
CityNav: A Large-Scale Dataset for Real-World Aerial Navigation (ICCV 2025)[paper][code]
-
CityNavAgent: Aerial Vision-and-Language Navigation with Hierarchical Semantic Planning and Global Memory (ACL 2025)[paper][code]
-
VLM-Nav: Mapless UAV-Navigation Using Monocular Vision Driven by Vision-Language Model (SSRN)[paper][code]
-
Learning Fine-Grained Alignment for Aerial Vision-Dialog Navigation (AAAI 2025)[paper][code]
-
UAV-VLA: Vision-Language-Action System for Large Scale Aerial Mission Generation (Int. Conf. on Human Robot Interaction, HRI 2025)[paper][code]
-
General-Purpose Aerial Intelligent Agents Empowered by Large Language Models (arXiv 2025.5)[paper][[code]]
-
RAVEN: Resilient Aerial Navigation via Open-Set Semantic Memory and Behavior Adaptation (arXiv 2025.9 "Best Paper Finalist at IROS 2025 Active Perception Workshop")[paper][project]
-
[Review] Large Language Models for UAVs: Current State and Pathways to the Future (IEEE Open Journal of Vehicular Technology 2024.8) [paper][[code]]
-
AeroVerse: UAV-Agent Benchmark Suite for Simulating, Pre-training, Finetuning, and Evaluating Aerospace Embodied World Models (arXiv 2024.8)[paper][[code]]
-
TPML: Task Planning for Multi-UAV System with Large Language Models (2024 IEEE 18th International Conference on Control & Automation (ICCA))[paper][code]
-
EAI-SIM: An Open-Source Embodied AI Simulation Framework with Large Language Models (2024 IEEE 18th International Conference on Control & Automation (ICCA))[paper][code]
-
Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning (STMR) (Submitted to ICRA 2025)[paper][[code]]
-
Visual Agents as Fast and Slow Thinkers (ICLR 2025)[paper][code]
-
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces (arXiv 2025)[paper][[code]]
-
Helix: A "System 1, System 2" VLA for Whole Upper Body Control (figure.ai) [link]
-
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models (Conference on Robot Learning (CoRL) 2024)[paper][project]
-
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models (Physical Intelligence (π)) (ICML 2025)[paper][blog]
-
HiRT: Enhancing Robotic Control with Hierarchical Robot Transformers (Conference on Robot Learning (CoRL) 2024)[paper][[code]]
-
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots (arXiv 2025.3)[paper][code][tech]
-
GR00T N1.5: An Improved Open Foundation Model for Generalist Humanoid Robots [tech][code][blog]