I ditched Copilot in VS Code for this free extension and it’s miles ahead


Although I’m starting to whittle down the VS Code extensions in my coding arsenal, I still find some of them essential for my programming tasks. For example, I use C++, Python, Terraform, Ansibleand other coding/IaC languages ​​that I use to train my DevOps skills. Similarly, I have Container Tools for my self-hosting experiences, and Prettier makes my terribly formatted code a bit more readable.

However, there is one extension that I consider more important than anything else in my setup – llama-vscode. If you haven’t heard of it, llama-vscode is designed to integrate large language models with VS Code, and I can say it’s better than GitHub Copilot for my coding needs, especially after integrating it with large LLMs running on my local workstations.


Running queries in llama-vscode

I rebuilt my VS Code installation from scratch this year and it’s the fastest it’s ever been

My VS Code was drowning in extensions

I don’t like the Copilot feature built into VS Code

Its subscription fees and privacy issues make it terrible for my workloads

Accessing Copilot integration in VS Code

Let’s be clear: I’m not trying to say that the Copilot integration baked into VS Code is weak. If anything, it’s far superior to my native models for crunching hundreds of billions of settings. However, sheer numbers aren’t everything, and some 26B-35B models are powerful enough to be decent replacements for their cloud counterparts (and I’ll get to that in a bit).

What keeps me from using Copilot is its subscription-heavy cloud-based nature. The free version puts limits on the number of chat requests and autocompletes, and I’m forced to hit those limits in several coding sessions. Sure, it might be cheaper than other AI-powered VS Code competitors, but I’d rather not spend extra money on a subscription every month.

While I’ve given up on my cheapskate nature, there is an issue (or rather, lack) of privacy when relying on an external server for my coding tasks. I often use LLMs to debug complex projects or to understand what a certain function does – and that involves loading lots of snippets (and sometimes entire configuration files) into clanker. Between the confidential nature of many project files and the fact that I often throw out sensitive information like user credentials and network details when asking AI for help, you can see why I’d be reluctant to use cloud-based models in my workflow.


Asus Rog Strix RTX 3080 Ti graphics card.

Your old GPU can still run great LLMs – you just need the right tweaks

You can do a lot with these models

The Llama-vscode extension has all the AI ​​features I could ask for

It’s enough to replace Copilot in my VS Code setup

Despite its self-hosted nature, llama-vscode is quite capable of standing up to the Copilot functionality baked into VS Code. The auto-suggestion facility works really well, especially when combined with a decent LLM. I also love that there are different shortcuts to accept the first word, line, or all of the suggested fragments.

The chat feature is equally useful for asking my LLMs about random features, and I can even add all the files as context when calling clankers for help with troubleshooting/debugging the project. Better yet, VS even supports agent coding and I can fine-tune the tools and MCP servers I want my LLMs to use during a coding session. Although its UI is a bit complicated to use VS Code’s Copilot, I got used to llama-vscode within just a few hours of using it for the first time.

The extension can even spin the llama.cpp environment

But I combined it with volume models running on local llama-server instances

As for models, llama-vscode includes built-in templates for common LLMs, ranging from simple Qwen 2.5 coder models that can run on CPUs to full-fledged GPT OSS (20B). There are even provisions for accessing OpenRouter based models, but I stay away from them for obvious reasons. I’m currently using two custom llama.cpp servers that I configured before switching to llama-vscode, as it’s much easier to fine-tune model settings on a separate LLM-hosting server.

My main PC is running an RTX 3080 Ti Qwen3.6-35B-A3Band I use it for most of my VS Code tasks. But for the rest of my self-managed app stack, I’m a Gemma-4-26B-A4B For example on my GTX 1080. Since both are Expert Mix models, I can simply load the experts and the less-used parts of LLM into the system’s RAM, while leaving the focus layers on the GPU, thus running the models on hardware that doesn’t need VRAM and still get reasonable marker generation speeds. Connecting them to llama-vscode was as easy as going to the Settings menu and entering my system’s IP addresses under the Endpoint URL fields.

Qwen3.6-35B-A3B is extremely useful especially for my coding projects. I’ve relied on it for everything from debugging weird functions to troubleshooting terminal outputs from failed Proxmox experiments, and it hasn’t let me down once. The best part? Since the resulting task only takes a few seconds, my LLM hosting servers have no impact on my energy bills.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *