Taking advantage of Microsoft Edge’s built-in AI

Thursday June 19, 2025. 11:00 AM , from InfoWorld

Large language models are a useful tool, but they’re overkill for much of what we do with services like ChatGPT. Summarizing text, rewriting our words, even responding to basic chatbot prompts are tasks that don’t need the power of an LLM and the associated compute, power, and cooling of a modern inferencing data center.

There is an alternative: small language models. SLMs like Microsoft’s Phi can produce reliable results with much fewer resources, as they’re trained with fewer parameters. One of the latest Phi models, Phi-4-mini-instruct, has 3.5 billion parameters trained on five trillion tokens. SLMs like Phi-4-mini-instruct are designed to run on edge hardware, taking language generation to PCs and small servers.

Microsoft has been investing in the development and deployment of SLMs, building its PC-based inferencing architecture on them and using ONNX runtimes with GPUs and NPUs. The downside is that downloading and installing a new model can take time, and the one you want your code to use may not be installed on a user’s PC. This can be quite the hurdle to overcome, even with Windows bundling Phi Silica with its Copilot+ PCs.

What’s needed is a way to deliver AI functions in a trusted form that offers the same APIs and features wherever you want to run it. The logical place for this is in the browser, as we do much of our day-to-day work with one, filling in forms, editing text, and working with content from inside and outside our businesses.

An AI model in the browser

A new feature being trialed in the Dev and Canary builds of Microsoft’s Edge browser provides new AI APIs for working with text content, hosting Phi-4-mini in the browser. There’s no need to expect users to spend time setting up either WebNN or WebGPU, or even WebAssembly, or requiring them to preload models and have the right security permissions in place for you to call the model and run a local inferencing instance.

There are other advantages. By running models locally you save money. You don’t need the expensive cloud inferencing subscription using GPT or similar. By keeping inferencing local, you’re also keeping user data private; it’s not transferred over the network and it’s not used to train models (a process that can lead to accidental leaks of personally identifiable information).

The browser itself hosts the model, downloading it and updating it as needed. Your code simply needs to initialize the model (the browser automatically downloads it if necessary) and then calls JavaScript APIs to manage basic AI functions. Currently the preview APIs offer four text-based services: summarizing text, writing and rewriting text, and basic prompt evaluation. There are plans to add support for translation services in a future release.

Getting started with Phi in Edge

Getting started is easy enough. You need to set Edge feature flags in either the Canary or Dev builds of Edge for each of the four services, restarting the browser once they’re enabled. You can then open the sample playground web application to first download the Phi model and then start experimenting with the APIs. It can take some time to download the model, so be prepared for a wait.

Be aware that there are a few bugs at this stage of development. The sample web application stopped updating the download progress counter roughly halfway through the process, but switching to a different API view showed that the installation was complete and I could try out the samples.

Once downloaded, the model is available for all AI API applications, and downloads only when an updated version is released. It runs locally so there’s no dependency on the network; it can be used with little or no connectivity.

The test pages are basic HTML forms. The Prompt API sample has two fields for setting up user and system prompts, as well as a JSON format constraint schema. For example, the initial sample produces a sentiment analysis for a review web application. The sample constraints ensure that the output is a JSON document containing only the sentiment and the confidence level.

With the model running in the browser and without the same level of protection as the larger-scale LLMs running in Azure AI Foundry, having a well-written system prompt and an associated constraint schema are essential to building a trustworthy in-browser AI application. You should avoid using open-ended prompts, which can lead to errors. By focusing on specific queries (for example, determining sentiment), it’s possible to keep risk to a minimum, ensuring the model operates in a constrained semantic space.

Using constraints to restrict the format of the SLM output makes it easier to use in-browser AI as part of an application, for example, using numeric values or simple text responses as the basis for a graphical UI. Our hypothetical sentiment application could perhaps display a red icon beside negative sentiment content, allowing a worker to analyze it further.

Using Edge’s experimental AI APIs

Edge’s AI APIs are experimental, so expect them to change, especially if they become part of the Chromium browser platform. For now, however, you’re able to quickly add support in your pages, using JavaScript and the Edge-specific LanguageModel object.

Any code needs to first check for API support before checking that the Phi model is available. The same call looks for whether the model is present or not or if it’s currently being downloaded. Once a download has been completed, you can load it into memory and start inference. Creating a new session is an asynchronous process that allows you to monitor download progress, ensuring the model is in place and that users are aware of how long it will take to download several gigabytes of model and data.

Once the model is downloaded, start by defining a session and giving it a system prompt. This sets the baseline for any interactions and establishes the overall context for an inference. At the same time, you can use a technique called “N-shot prompting” to provide structure to outputs by providing a set of defined prompts and their expected responses. Other tuning options define limits for how text is generated and how random the outputs are. Sessions can be cloned if you need to reuse the prompts without reloading a page. You should destroy any sessions when the host page is closed.

With the model configured, you can now deliver a user prompt. This can be streamed so you can watch output tokens being generated or simply delivered via an asynchronous call. This last option is the most likely, especially if you will be processing the output for display. If you are using response constraints, these are delivered alongside the prompt. Constraints can be JSON or regular expressions.

If you intend to use the Writing Assistant APIs, the process is similar. Again, you need to check if the API feature flags have been enabled. Opening a new session either uses the copy of Phi that’s already been downloaded or starts the download process. Each API has a different set of options, such as setting the type and length of a summary or the tone of a piece of writing. You can choose the output type, either plain text or markdown.

CPU, GPU, or NPU?

Testing the sample Prompt API playground on a Copilot+ PC shows that, for now at least, Edge is not using Window’s NPU support. Instead, the Windows Task Manager performance indicators show that Edge’s Phi model runs on the device’s GPU. At this early stage in development, it makes sense to take a GPU-only approach as more PCs will support it—especially the PCs used by the target developer audience.

It’s likely that Microsoft will move to supporting both GPU and NPU inference as more PCs add inferencing accelerators and once the Windows ML APIs are finished. Windows ML’s common ONNX APIs for CPU, GPU, and NPU are a logical target for Edge’s APIs, especially if Microsoft prepares its models for all the target environments, including Arm, Intel, and AMD NPUs.

Windows ML provides tools for Edge’s developers to first test for appropriate inferencing hardware and then download optimized models. As this process can be automated, it seems ideal for web-based AI applications where their developers have no visibility into the underlying hardware.

Microsoft’s Windows-based AI announcements at Build 2025 provide enough of the necessary scaffolding that bundling AI tools in the browser makes a lot of sense. You need a trusted, secure platform to host edge inferencing, one where you know that the hardware is able to support a model and where one standard set of APIs ensures you only have to write code once to have it run anywhere your target browser runs.