YOU ARE AT:AutomationAutonomous operations: Why communications providers can’t afford to wait (Reader Forum)

Autonomous operations: Why communications providers can’t afford to wait (Reader Forum)

Amid heightened network expectations, communications providers are faced with increasingly complex environments

In today’s hyperconnected world, disruptions or disconnections of any form are more than an inconvenience. They have evolved into reputation and financial liabilities that no organization wants to face. Whether it’s a streaming blackout during the Super Bowl or when you are trying to catch the latest headlines and the mobile network falters, consumer tolerance for downtime has fallen to zero. In a July 2025 report, 51% of survey respondents noted that internet outages and downtime can lead to million-dollar losses or more per month.

Behind the scenes of these heightened expectations, communications and media infrastructure (CMI) providers are faced with increasingly complex environments that stretch across hybrid cloud, edge data centers, and expansive global networks. This depth of interconnectivity is creating fragile ecosystems where the margin for error is razor thin.

The rising cost of downtime

Today’s outages carry a steep price tag, with a growing number of organizations reporting losses in the millions for every hour that systems remain unavailable. A 2024 study by the Uptime Institute gave further insight into how resilience remains elusive, even in our digital era. In the report, nearly 70% of operators reported experiencing some form of outage in the past three years, with more than half acknowledging a “serious” or “severe” impact. Going beyond these initial and direct costs, operators face prolonged losses in the form of customer churn, regulatory scrutiny, and reputational harm that can take years to repair.

For communications providers, the growing risks will continue to primarily converge in an environment where subscribers expect flawless delivery, twenty-four hours a day, three hundred and sixty-five days of the year. This means that antiquated, reactive operational models are no longer sufficient.

The fault in reactivity

Traditionally, network operations teams responded to issues as they were identified. An outage would trigger an alarm or alert. Next up, technicians would isolate the fault, and finally remediation would occur. However, that methodology is not scalable to today’s user base or demand. At any given point, there could be millions of concurrent customer sessions spanning multiple geographies and platforms. The traditional three-step reactive approach cannot keep pace.

The 2025 Verizon cellular service outage exemplifies this. While no detailed technical report was ever released, a software-based issue caused devices across major U.S. cities to suddenly drop into “SOS” mode, cutting off users’ wireless service. Sites that monitor technology downtime were flooded with more than 20,000 outage reports, and social media was filled with complaints. This layered incident sent ripples across the entire digital ecosystem, from end-devices to regional data centers, and highlights how fragile interconnected systems have become.

In today’s environments, by the time alarms reach human operators, the reputational damage is either already done or will continue to occur. Customers who have already encountered service degradation will search for answers. First on social media, transitioning to websites that monitor live service outages, and then often switching back to social media. From there, posts can quickly go viral across several platforms, potentially catching the eyes of regulators who may begin to ask questions.

In all of this noise, teams searching for the root cause of the disruption can become overwhelmed and distracted by the continuous barrage of alerts for disparate systems. It’s becoming increasingly evident with every disruption. Reactive models are built for yesterday’s networks. Tomorrow’s networks demand proactive and even autonomous capabilities.

The age of automation and the role of intelligent operations

One of the fastest growing approaches in IT operations is the combination of automation and autonomy. These models can harness advanced analytics and AI to not only detect potential failures before they even occur but can also remediate them before they impact the end user. This approach and philosophy are a replication of what has long been the standard in other industries such as aviation or power systems, focusing on self-monitoring, self-governing, and self-healing. These three concepts are table stakes for critical infrastructure.

For CMIs, autonomy can be observed across several dimensions:

  • Proactive detection: Rather than waiting for failures to trigger alarms, intelligent systems continuously monitor for anomalies in traffic, latency, or discrepancies in application behavior. This allows for the generation of early warnings and interventions before a customer notices disruption.
  • Noise reduction: By leveraging AI-driven event correlation, filters can be created to consolidate alerts and reduce the potential for false-positives to appear. Allowing teams to focus on events or alerts that truly matter. This reduction in noise alleviates the overall cognitive load on operators and speeds mean time to resolution.
  • Automated remediation: Routine or predictable incidents, such as restarting services or rerouting traffic, are just some of the tasks that can be executed autonomously, cutting response times by up to 90%. While human experts remain in the loop, they are now capable of tackling complex or high-impact scenarios, while automation handles the bulk.

The goal and result should be a shift from firefighting to being in a position of foresight. In this shift, resilience is deeply woven into and strengthens the fabric of operations.

The hybrid-cloud challenge

The urgency to integrate and implement autonomous operations is being driven and echoed by the growing hybrid-cloud reality. In the age of global communications, providers rarely operate within the neat boundaries of a single data center or vendor. Instead, the reality is that workloads are distributed across hyperscale cloud environments, private infrastructure, edge sites, and third-party services.

This distribution introduces numerous levels of complexity. For instance, a single customer session might touch multiple cloud providers, traverse public and private backbones, and depend on services outside of direct operator control. In this scenario, any of these threads could snap, creating a tear that impacts the whole fabric of users.

While visibility and insights into these events are great, it is not enough. Providers must turn towards systems that can correlate signals across heterogeneous domains, learning from patterns, and taking context-driven actions, often in milliseconds.

Building trust in autonomous systems

While ensuring uptime is a priority, skepticism remains. Executives still question whether AI-driven autonomy can be trusted in mission-critical environments. Often their concerns revolve around explainability of actions to both customers and auditors, as well as the likelihood and context of false triggers. There is also a growing focus surrounding the risk of “black box” decisions that operators can’t audit.

To address these questions and reassure teams from executives all the way to operations, all groups must understand that the path forward is not as simple as a flip of a switch. Rather, it is a process of adopting technology in meaningful layers.

  • The starting point: Organizations could start their autonomy journey by going live with assistive intelligence, meaning teams use AI to flag issues and recommend actions.
  • The next step: With assistive intelligence in hand, teams can be bolder and level up to semi-autonomous remediation, where automation resolves only low-risk scenarios.
  • The final leap: As confidence in the adopted technology improves, teams can push to extend autonomy to more critical domains.

Equally important in this conversation and pursuit of automated and autonomous growth is governance. Clear guardrails, human-in-the-loop design, and explainability must be central to every autonomous operation. These elements ensure that accountability is present while also unlocking the benefits of speed and scale.

The business case for autonomy

The case for the adoption of autonomous operations isn’t just technical. It’s part of a growing conversation about their existential importance to an organization’s survival. In today’s fiercely competitive market, subscriber loyalty can be earned or lost within minutes. Investors and regulators are increasingly scrutinizing resilience and holding providers accountable for outages.

By gradually moving towards autonomy, catastrophic failures can be mitigated, putting operators in a better position to deliver superior customer experiences, gain the ability to scale operations, reduce operational expenditures tied to manual firefighting, and free up engineering talent to focus on innovation.

Looking ahead

Tomorrow’s communication networks are already being developed. 5G Advanced, cloud-native architectures, and other emerging technologies are all coming and placing increased pressure on organizations and their networks to be woven with resiliency to the core. Disruption of any digital service will ripple across industries and will not be received kindly.

The future is already here; autonomous operations are available and quickly finding their place within complex ecosystems. CMI providers have the potential to embed intelligence, automation, and governance into their operational core, resulting in the construction of a telecommunications infrastructure that is resilient by design, capable of withstanding the shocks and stresses of tomorrow’s digital ecosystem.

The message to providers is clear. Subscriber trust and loyalty are the currency of the connected age. Protecting this trust requires moving beyond reaction and embracing autonomy.

ABOUT AUTHOR