My self-hosted LLMs are more than just a chat replacement – here's how they boost my productivity

Ask Hardcore ChatGPT, Claude, and Perplexity for their thoughts on native LLMs and you’ll hear a lot of arguments about their performance-sapping nature, complex structure, and lack of computational prowess. However, the most common complaint is that self-hosted models are not capable of much more than serving as chatbots. To be honest, I thought the same before entering the local LLM ecosystem, as my limited interaction with poor models left me dissatisfied with the results.

But after testing all kinds of models on my workstation nodes and combining them with FOSS utilities in my home lab, I learned that local models can become productivity powerhouses if configured correctly.

Claude can call when local LLM gets stuck and he changed everything about my local first install

Domestic LLMs are not very good on their own

No need to limit your LLM workloads to low-parameter models

The key word here is TN discharge

Most of the misconceptions about lack of accuracy and reasoning capabilities come from people who attempt to self-deploy low-parameter LLMs, and they’re absolutely right. Aside from a handful of finely tuned models, most clankers in the 0.8B-4B range tend to spout utter gibberish when asked to do anything even remotely technical. The 20B and larger models are significantly better at grounding, but they require a lot of computational skill. But here’s the thing: you don’t have to resort to poor models, and with the right set of tweaks, it’s possible to host large LLMs even on older machines. The source? Regards, currently using a model 26B on a ten year old machine!

It can still run well on any old hardware, despite the huge feat of thinking, thanks to the Expert Blend models. In traditional LLMs, the only way to load them on VRAM-constrained GPUs –ngl tick all layers to push system memory resources, causing extremely slow speeds once they start demanding them. On the other hand, TN models allow you to load heavy specialists into the main CPU and RAM. –n-cpu-moebut you can still run focus layers on the graphics card. As long as the combined RAM and VRAM can accommodate the tens of billions of parameters required by the TN model, you’re good to go.

Actually, I host myself Gemma4-26B-A4B on GTX 1080 only (8 GB VRAM + 24 GB system memory) and Qwen3.6-35B-A3B on RTX 3080 Ti (12 GB VRAM + 32 GB system RAM). Even with the workaround, my token generation rates don’t drop below 14t/s, which is pretty impressive considering I haven’t upgraded these systems in a long time. As for the computing capabilities of these LLMs, they are more than enough for my day-to-day work.

Your old GPU can still run great LLMs – you just need the right tweaks

You can do a lot with these models

Local LLMs pair incredibly well with coding programs

Even Claude Code supports native models

Starting with the area where you’re likely to find AI tools, high spec LLMs are surprisingly capable of meeting my coding requirements. In fact, I’d say Qwen3.6-35B-A3B can go toe-to-toe with subscription-based cloud models when it comes to code generation, auto-suggesting all fragments, scanning blocks for vulnerabilities, and eliminating weird terminal results. I have configured call-vscode Pairing my llama-server instance with VS Code and the integration fits well with my programming workloads.

With the very powerful Claude Code supporting native LLM backends, I locked myself into Qwen3.6-35B-A3B. Let me tell you, there’s nothing better than promoting this hegemon of LLM for everything from refactoring broken code to rewriting functions with proper indentation, without incurring rate restrictions or high API usage penalties in Claude Code.

I’ve built a FOSS productivity stack centered around my native models

Aside from my coding assignments, LLMs are equally valuable to my FOSS arsenal. For example, I connected a llama-server running Gemma4-26B-A4B to the logging Blinko, and LLM is quite efficient at querying, summarizing and tagging my logging collection. Likewise, my Gemma4 and MiniCPM-V models are perfect for performing OCR analysis on my documents via Paperless-GPT. This setup also works well for Paperless AI, which is responsible for assigning the correct correspondents, tags, dates, and other fields to my documents, and also helps queries with RAG support on large documents.

Then there’s Karakeep and its automatic summarizing tools, which work on PDFs, YouTube videos, and any hyperlinks I add to this bookmark manager. I haven’t even mentioned Open WebUI, the Swiss army knife of AI tools. On the surface, it’s a chatbot front-end, but once you get into its settings, you’ll realize that it can be anything from a powerful AI search engine with SearXNG to a rendering agent with ComfyUI models. In addition, it even supports MCP servers, so I rely on it to run other applications that don’t natively support LLMs. While we’re on the subject…

MCP servers can combine their advanced reasoning capabilities with unsupported applications

If you haven’t heard of them, Model Context Protocol servers are bridges that expose specific tools and files from other applications to LLM clients, and they’re especially useful for platforms that don’t have any AI-centric features. For example, I have a TrueNAS instance connected to an MCP server running on my computer, and after disabling any tools that can write data to the NAS, I can send simple voice commands to my LLMs and have them respond to my requests immediately. The same is true for Nextcloud, and my preferred MCP server can even call Calendar, News, and Office apps to manage every aspect of my private cloud.

Native LLMs are not just chatbot replacements

Having spent hours configuring the correct LLM pipelines for my FOSS models, I have to admit that native AI tools are more useful than they first appear. Larger models like my favorite Qwen3.6-35B-A3B handle pretty much every task I throw at them while running at decent speeds on older hardware. Throw in a handful of containerized services, add MCP servers for those that don’t support my llama-server locally, and my local LLM pipeline becomes good enough to replace cloud platforms for everyday productivity tasks.

Source link

My self-hosted LLMs are more than just a chat replacement – here’s how they boost my productivity

Claude can call when local LLM gets stuck and he changed everything about my local first install

No need to limit your LLM workloads to low-parameter models

The key word here is TN discharge

Your old GPU can still run great LLMs – you just need the right tweaks

Local LLMs pair incredibly well with coding programs

Even Claude Code supports native models

I’ve built a FOSS productivity stack centered around my native models

MCP servers can combine their advanced reasoning capabilities with unsupported applications

Native LLMs are not just chatbot replacements

Leave a ReplyCancel Reply

Sam Altman wants to go on record to show that AI won’t take your job (despite everything he’s said before)

A Google engineer has been charged with insider trading after making $1.2 million in Polymarket

French startup funding to drop 5% in 2025 as AI concentration grows

Claude can call when local LLM gets stuck and he changed everything about my local first install

No need to limit your LLM workloads to low-parameter models

The key word here is TN discharge

Your old GPU can still run great LLMs – you just need the right tweaks

Local LLMs pair incredibly well with coding programs

Even Claude Code supports native models

I’ve built a FOSS productivity stack centered around my native models

MCP servers can combine their advanced reasoning capabilities with unsupported applications

Native LLMs are not just chatbot replacements

Leave a ReplyCancel Reply

Trending now

Sam Altman wants to go on record to show that AI won’t take your job (despite everything he’s said before)

A Google engineer has been charged with insider trading after making $1.2 million in Polymarket

French startup funding to drop 5% in 2025 as AI concentration grows