Beyond Chat: How to Set Up Local LLM Code Completion in VS Code with Ollama – SitePoint

Spread the love

This metrics tool terrifies bad developers
This metrics tool terrifies bad developers
Share this article
7 Day Free Trial. Cancel Anytime.
Open-source models have caught up, and pair that with Ollama for local model serving and the Continue extension for VS Code integration, you can build a fully offline code completion pipeline in under 30 minutes. This tutorial walks you through every step.
Table of Contents
GitHub Copilot has become table stakes for millions of developers. Microsoft reported more than 77,000 organizational customers during its Q3 FY2025 earnings call. It dominates the AI coding assistant market. But here’s the tension: the same organizations adopting AI coding tools are also tightening the rules around them. Setting up local LLM code completion in VS Code gives you the productivity boost without the compliance headache.
Samsung banned employee use of ChatGPT and similar generative AI tools after engineers uploaded confidential source code and internal meeting notes to external servers. Finance, defense, and healthcare companies followed with their own restrictions. The worry is straightforward: intellectual property exposure and regulatory obligations. When your code leaves your machine and hits a third-party API, you lose control over how it gets stored, processed, or fed into model training.
When your code leaves your machine and hits a third-party API, you lose control over how it gets stored, processed, or fed into model training.
The good news: open-source models have caught up. Qwen2.5-Coder-7B scores competitively on HumanEval, putting it in the same ballpark as earlier GPT-3.5-level performance. Pair that with Ollama for local model serving and the Continue extension for VS Code integration, and you can build a fully offline code completion pipeline in under 30 minutes. This tutorial walks you through every step.
When you use a cloud-based code assistant, your editor ships code context to a remote server for inference. That context can include proprietary business logic, API keys accidentally left in source files, internal architecture patterns, and unreleased feature implementations. The Samsung incident makes this concrete: employees pasted sensitive semiconductor source code and internal meeting transcripts into ChatGPT, effectively handing trade secrets to an external service. Once data hits a third-party server, your organization’s control over it depends entirely on the vendor’s retention policies and security posture.
If your organization operates under HIPAA, SOC 2, ITAR, or FedRAMP requirements, sending code context to external APIs introduces audit complexity at best and outright violations at worst. Air-gapped environments, common in defense and critical infrastructure, make cloud-based tools physically impossible to use.
Local inference sidesteps all of this. The model runs on your hardware. The data stays in your process. There’s no network request to audit or justify.
The architecture is simple. VS Code runs the Continue extension, which acts as the client layer for both tab completions and chat interactions. Continue sends requests over HTTP to Ollama, which runs as a local server on your machine. Ollama loads the model into memory, runs inference, and streams results back to Continue, which renders them as ghost text in your editor or as chat responses in the sidebar.
The flow:
VS Code Editor → Continue Extension → HTTP (localhost:11434) → Ollama → Loaded Model → Response
For tab completions, Continue uses fill-in-the-middle (FIM) prompting. It sends the code before and after your cursor as context and asks the model to predict what goes in between. For chat, it uses standard instruct-style prompting. Ollama handles model management, quantization, and GPU offloading, so you never touch raw model weights directly.
You don’t need a monster rig for this, but more muscle helps.
That’s it. No Docker, no Python environment, no CUDA toolkit wrangling.
Install Ollama with a single command:
After installation, start the Ollama service:
On macOS and Windows, the desktop app starts the service automatically.
Qwen2.5-Coder at 7B parameters is a strong choice for local code completion. It supports a 128K context window natively (though you’ll typically constrain this in practice for performance) and runs comfortably on 16 GB of RAM at Q4 quantization. Pull it:
Run a quick smoke test:
You should see a complete, syntactically valid function streamed to your terminal. If you get garbage, something went wrong with the pull. Try again.
Ollama exposes a REST API on http://localhost:11434 by default. Verify it:
You should get back a JSON response with a response field containing the model’s completion. If this works, your local inference backend is ready.
Several extensions connect VS Code to local models: Twinny, llama-coder, CodeGPT, among others. I keep coming back to Continue for a few reasons. It supports both tab autocomplete and sidebar chat in a single extension. It lets you assign different models to different roles, so you can run a small, fast model for completions and a bigger one for chat. The maintainers ship frequent releases. And it’s open source.
Open the Extensions panel, search for “Continue,” click Install. Done.
For air-gapped environments, download the .vsix from the Continue GitHub releases page and install manually via Extensions: Install from VSIX in the command palette.
Continue stores its configuration in config.json (the exact location depends on your Continue version; newer versions use ~/.continue/config.json, though the project has been migrating toward a config.yaml format). This file defines which models to use, how to reach them, and how completions behave. Here’s a complete, annotated configuration you can drop in and start using:
Let me walk through what matters here.
tabAutocompleteModel is the model that powers inline ghost-text completions. I set contextLength to 4096 here because shorter context keeps inference fast. You’re not writing a novel; you’re completing a function.
models defines what’s available in the chat sidebar. Higher contextLength makes sense here because chat interactions are less latency-sensitive than tab completions.
debounceDelay controls how many milliseconds Continue waits after you stop typing before firing a completion request. 500ms works well on most hardware. Drop it to 300ms if you’re on an M3 Max and feeling impatient.
maxPromptTokens controls how much surrounding code Continue sends as context. More context means better suggestions but slower inference. 2048 is a solid default.
multilineCompletions set to "always" gives you full function body completions, not just single-line fills. This is where local models really start to feel useful.
allowAnonymousTelemetry set to false keeps everything local. The whole point of this setup, right?
You’re not writing a novel; you’re completing a function.
Open any code file in VS Code. Type a function signature and pause:
Press Tab to accept. Escape to dismiss.
If completions feel sluggish, you’ve got several levers to pull.
Cut contextLength in tabAutocompleteModel to 2048 or even 1024. Pull a Q4 quantization variant if you haven’t already. Bump debounceDelay to 750ms to reduce inference calls while you’re still mid-thought.
You can also create a custom Modelfile for Ollama with tuned parameters:
Save that as Modelfile and create your custom model:
Then update config.json to reference qwen-fast as the model name.
Want better suggestions? Increase contextLength, bump temperature to 0.4 for more varied completions, or step up to a 14B or 32B model if your hardware can handle it. The tradeoff is always speed versus quality, and the right balance depends on your machine and your patience.
You can use the same model for both autocomplete and chat, or dedicate a larger model to chat. To add a separate chat model, update the models array in your config:
The chat sidebar in Continue supports inline edit commands, code explanation, and refactoring workflows, all running against your local model.
Continue supports context providers that let you reference parts of your workspace in chat. Type @file to include a specific file’s contents, @codebase to let Continue search your project for relevant code, or @terminal to include recent terminal output. This gives the model grounded context without any data leaving your machine.
I use @codebase constantly. It’s the closest thing to Copilot’s multi-file awareness you’ll get locally.
A note on those numbers: HumanEval pass@1 scores vary significantly depending on evaluation methodology, prompting strategy, and quantization. The figures above are approximate, based on published benchmarks from model authors. Your results with quantized local inference will differ. I’ve seen Qwen2.5-Coder produce noticeably worse completions at Q4 than its reported benchmarks suggest. Still good. Just not magic.
For web development in TypeScript or Python, Qwen2.5-Coder-7B offers the best quality-to-size ratio. I’ve tried the others and keep coming back to it.
If you’re working on enterprise Java or C++ with large files, DeepSeek-Coder-V2-Lite handles longer context better, though note it uses a Mixture-of-Experts architecture with 2.4B active parameters out of 16B total.
Running on 8 GB of RAM? Drop to qwen2.5-coder:3b. The quality takes a hit, but it’s still useful for boilerplate and simple completions.
Local inference has zero latency variance. No network round-trip. No server-side queuing. No outage risk. It works on planes, in classified facilities, behind any firewall. There’s no telemetry to audit, no data retention policy to negotiate with legal.
I ran both setups side-by-side for two weeks. For single-file Python and TypeScript work, the local setup felt faster than Copilot on several occasions, simply because there was no network jitter.
Copilot benefits from frontier-class models with massive context windows and multi-file awareness powered by cloud-scale compute. For complex cross-file refactoring, cloud models still produce better results. And there’s zero hardware overhead on your local machine.
Local 7B models cover the majority of what most developers actually use Copilot for day-to-day: single-file completions, boilerplate generation, short function implementations. Stepping up to 14B or 32B for chat narrows the gap further. For developers in restricted environments, this is more than enough.
It won’t blow your mind on a complex multi-file refactor across 15 modules. Be realistic about that.
Local 7B models cover the majority of what most developers actually use Copilot for day-to-day: single-file completions, boilerplate generation, short function implementations.
If completions aren’t appearing, verify Ollama is running with ollama list (it should show your pulled models). Check that the model name in config.json exactly matches the Ollama model tag. Open the Continue output panel in VS Code (Output → Continue) to see error messages. Nine times out of ten, it’s a model name mismatch.
Slow or laggy completions usually mean you’re sending too much context. Reduce contextLength and maxPromptTokens in your autocomplete config. Run ollama ps to confirm the model is loaded and check whether GPU layers are being offloaded. If you see 0 GPU layers, that’s your bottleneck.
Getting nonsensical suggestions? Make sure the model you’re using supports FIM (fill-in-the-middle) prompting. Lower temperature to 0.1 or 0.2 for more deterministic output. Cut maxTokens to stop the model from rambling past the useful completion.
The local code completion stack has crossed from “interesting experiment” to “daily driver.” Ollama makes model management painless, Continue provides a polished editor experience, and 7B-class code models deliver genuinely useful suggestions. This isn’t about replacing Copilot for every developer. It’s about having a real alternative when your code can’t leave your machine.
This isn’t about replacing Copilot for every developer. It’s about having a real alternative when your code can’t leave your machine.
Copy the config JSON from this article, pull qwen2.5-coder:7b, and run it for a week alongside your current workflow. The Continue and Ollama communities on Discord and GitHub are active and helpful when you hit snags.
Your code stays on your hardware. That’s the whole point.
Matt is the co-founder of SitePoint, 99designs and Flippa. He lives in Vancouver, Canada.
7 Day Free Trial. Cancel Anytime.
Get the freshest news and resources for developers, designers and digital creators in your inbox each week

source

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top