Lower inference costs could dramatically expand enterprise AI adoption: d-Matrix CEO

As artificial intelligence moves from experimentation to large-scale deployment, the economics of AI infrastructure are beginning to shift. While much of the industry’s early focus has been on training increasingly large models, the next phase of growth is likely to be driven by inference — the process of running AI models in production at scale.
In a conversation with TechCircle, Sid Sheth, founder, president and CEO of d-Matrix, argues that efficient inference — not just model training — will define the competitive dynamics of the AI industry over the coming decade. A semiconductor industry veteran who previously served as senior vice-president at Inphi Corporation, Sheth explains why the company is building a new class of inference-focused chips, the limits of GPU-centric AI infrastructure, and India’s growing role in the global semiconductor ecosystem. Edited excerpts.
AI is moving from model training to large-scale deployment. Why do you believe inference — rather than training — will define the economics going forward?
We founded d-Matrix in 2019 with a singular focus on inference. Our belief was simple: not everyone will train models, but everyone will want to use them. Inference is what actually brings AI into the real world and enables applications that millions of users interact with every day.
When we first spoke to investors, many were still asking basic questions like “What is inference?” But we believed that once models became capable enough, running them efficiently at scale would become the real bottleneck.

That’s exactly what we’re seeing now. New reasoning models and open ecosystems have accelerated the application layer, with developers building interactive applications across coding, video and agentic workflows.
Training builds the model, but inference runs it billions — eventually trillions — of times each day. Training is periodic, while inference is continuous. As AI becomes embedded into products and workflows, inference demand multiplies rapidly. The companies that can deploy intelligence efficiently at scale will define the economics of the next decade.
d-Matrix is betting on Digital In-Memory Computing instead of traditional GPU-based architectures. What structural limitations in GPU design make them less optimal for sustained inference workloads?
GPUs were designed primarily for large-scale parallel computation, which makes them extremely effective for training models. But inference at scale presents a different challenge. Modern inference involves constant interaction between compute and memory. Moving data between the two consumes time, bandwidth and power, and as inference workloads grow, memory bandwidth — not compute — becomes the main bottleneck. Technologies like high-bandwidth memory have helped extend performance, but they don’t fundamentally eliminate the separation between compute and memory.

Our approach is to fuse compute and memory directly. Matrix multiplications occur inside the memory array itself, reducing the need for constant data movement. This significantly improves efficiency and lowers latency. We’re also working on next-generation designs that stack memory vertically using 3D DRAM, which increases bandwidth and capacity without relying on scarce external memory technologies. In many ways, inference is fundamentally a memory problem, and solving it requires architectural redesign rather than incremental scaling.
The company has raised over $450 million, including backing from Microsoft. What does hyperscaler interest signal about the urgency to rethink AI infrastructure?
Microsoft began engaging with us in 2021 with a strong understanding of the inference challenge, particularly for transformer-based workloads. Their continued support across funding rounds signals something important: hyperscalers recognise that scaling AI is not only a training challenge. Inference will ultimately be the larger infrastructure problem.
Most solutions today still rely on variations of the GPU model — pairing powerful compute with external memory. While that works, it also introduces structural issues around cost, latency and supply constraints.

Our architecture is designed specifically for transformer inference workloads. Importantly, we don’t see ourselves as replacing GPUs. Instead, the future will be heterogeneous, where GPUs handle certain workloads and specialised inference accelerators handle others.
Our PCIe-based accelerator cards can be deployed within existing infrastructure, allowing customers to scale AI without rebuilding their data centres.
As enterprises shift toward smaller, task-specific models, how materially can lowering inference costs change enterprise AI adoption?
Lower inference costs could dramatically expand enterprise AI adoption. A key turning point came when developers began distilling large frontier models into smaller, efficient ones that retained strong capabilities. Many enterprise applications now rely on these smaller models, where GPUs can sometimes be excessive.
We’re seeing strong traction around models like DeepSeek and Qwen among developers and large enterprises. As companies adopt multiple smaller models across applications, the infrastructure challenge shifts from running one massive model to efficiently serving many different workloads.
Today, applications like coding copilots, autonomous agents and real-time customer support systems remain constrained by inference latency and cost. When the cost per token drops significantly, those applications become economically viable at scale.
Efficient inference will democratise AI adoption, allowing organisations of all sizes to embed AI into everyday workflows.
India has traditionally been strong in software but less visible in semiconductor IP. Do companies like d-Matrix represent a shift in India’s role in the AI value chain?

India’s position in the AI ecosystem is clearly rising. At d-Matrix, our Bengaluru team contributes directly to architecture, system design and advanced silicon development. Their work is not limited to implementation; they are involved in shaping the core technology roadmap.
Semiconductor innovation is becoming more globally distributed, and India is increasingly participating in foundational design decisions that influence next-generation infrastructure. With continued investment in education, startups and policy support, India can play a much larger role in the global AI value chain.
Is the AI infrastructure market today too dependent on Nvidia, and do you see room for credible alternatives?
NVIDIA has built an extraordinary ecosystem around AI training, and its platform depth is a significant advantage. However, as inference workloads scale, the industry is moving toward heterogeneous AI infrastructure. GPUs will remain central, but additional coprocessors will emerge to handle specialised workloads more efficiently. The future is not about replacing GPUs but diversifying the hardware stack so that each workload runs on the most appropriate architecture.
With AI data centres facing rising energy scrutiny globally, how important will energy efficiency be in chip design?
Energy efficiency will become a critical differentiator in AI infrastructure. Much of today’s AI cost comes from moving data between compute and memory, which consumes power and introduces latency. In some cases, advanced chips are already being deployed in facilities that struggle to supply sufficient power.

At d-Matrix, we designed our architecture to minimise data movement by integrating compute with memory. Our accelerators are also air-cooled, making them easier to deploy within existing data centres. Because of these design choices, we are able to deliver significantly better performance per watt compared with traditional GPU-based systems. Improving efficiency will make large-scale AI deployment both economically and environmentally more sustainable.
