I've run Gemma, Llama and Gwen on my phone and only one of them is worth keeping

There have been local LLMs part of my desktop setup For what feels like ages now, and I’ve written about a fair share of them. I had never taken the phone side so seriously, even after my initial experience works with Gemma 4 on mobilei still default to Claude, that probably won’t change anytime soon.

But the mobile phone has native artificial intelligence proved invaluable during outages – at least when ISP and cell tower outages happen at the same time, this happens here from time to time. So I thought it might be worth getting more serious about it and exploring what else was out there, so I landed on Llama, Gemma and Phi, but took a detour with Gwen. I put them all through the same tests to determine which ones are actually worth keeping on my phone.

Gemma 4 E2B

Starting with what I already know

Gemma 4 E2B is Google’s flagship model designed from the ground up work on the phone rather than shrinking from something larger. Its architecture allows it to work with about 2GB of memory space, and it handles images natively. It was my first mobile model, and almost the only one, because it worked so well that I saw no reason to deviate from real use cases. So I thought I’d see how my main line stacks up against the new competition.

I first ran it through a few prompts that the local LLM community uses as a quick vibe check – the strawberry test, the marble in the cup, and the Sally brothers. I got these from me Mervyn Praison’s siteAI developer.

According to Gemma, there are two R’s in Strawberry, which is a popular fail for small models – it’s something to do with how the word is punctuated, and even some cloudbots stumble on it for whatever reason. The marble was strange, though strange; he went logically step by step, correctly identified that the cup was upside down, then somehow landed on the marble still in the microwave with the cup. Sally’s sisters were in the same shape and her reasoning pointed to the right answer before she ever got there. He had the same examples to explain the steps between the three beautifully, he just couldn’t answer. Not exactly stupid, more like snaps at the end of a reasoning thread.

Then I gave him the actual assignment I came here for: a Python 101 weekend course built for complete beginners, each with code samples and exercises. The course was good. He put real effort into the structure, breaking everything into one-hour blocks with code samples and exercises. Where it fell short in my opinion was the technical setup – it just told me to “open a text editor or online Python interpreter” without really thinking too much about what those were for complete beginners. I think Gemma is my visual role model because courses aren’t really her forte.

I ran native LLMs on Intel’s cheapest iGPU and the results were surprisingly decent

It’s not suitable for a dedicated GPU, but you can run some light LLMs on the N100.

Llama 3.2 3B

Meta’s dialogue specialist, on the phone

Llama 3.2 is 3B Meta’s small dialog model, tuned specifically for the conversational and generalization tasks you’ll be working on on the phone. Meta’s own benchmarks claim it outperforms the Gemma 2 2.6B and the Phi-3.5 mini, although Meta will say that. I chose it because the Llama family is the default name that everyone goes to when they want a small model, so I had to try it out for myself. Also because I had no idea how it would compare to my Gemma base.

The llama also got the strawberry wrong, but in a funnier way. He spelled the word (STRAWBERRY) and then again said there were two R’s. The marble question was an example of the same logical but incorrect conclusion as Gemma. Sally’s sisters were actually worse, the answer is two, and the “one for each of her siblings” reasoning does not follow. Just for fun, I also asked what she likes to eat for breakfast, and she refused to have a polite personality, followed by a list of breakfast options. Overall, what I expected, as most models experience these, but the Llama was a bit worse.

The course was where Llama really stepped up. The structure felt less fancy than Gemma without all the emojis and bold section headings, giving me clean blocks with just code and explanations. The code itself felt a bit more practical, it included an input query and some time looping in the final draft, which is more than Gemma’s Hello World demo gave me. I’m not a Python guy, so I can’t say for sure if this is a good course in the right sense, but it felt closer to something I’d watch on a Saturday morning.

I finally found a local LLM that I want to use every day (and it’s not for coding)

Native AI that matches my day

Gwen 3.5 4B

My unplanned third choice

Phi-3.5 mini was supposed to be the third model. I wanted to test it because Microsoft trained it on filtered “textbook quality” synthetic data rather than regular internet scrap, which is a really interesting design choice. But the Phi crashed every amount of PocketPal I tried, and the smallest IQ2 version crashed my whole phone for some reason. So I swapped in the Gwen 3.5 4B. It’s Alibaba’s small, dense model of thinking and vision. It’s bigger than Phi, but ran smoothly anyway.

According to the main prompts, Gwen was the only one of the three who actually got the strawberry right – she listed each letter by the number of runs and got down to three. Again the marble question with the logic that defies gravity that everyone uses. Where it got weird was the Sally sisters. Gwen started with three, talked down to two, and then second-guessed herself with “wait, let me reassess” before the middle answer went back to list the family from scratch. I feel like he’s second-guessing himself, which isn’t a bad thing.

The course was the strongest of the three. Gwen named special editors that beginners could actually open—VS Code, PyCharm, Replit—rather than fumbling with a “text editor,” and the final project was a type calculator with real-world variables instead of a welcome loop. Probably the one I downloaded for offline study.

I ran Gemma 4 and Gwen 3.5 for the same local tasks and one was miles ahead

I pit them against each other to find the best one for my workflow

I would actually keep it

My last choice would be Gwen. It far surpassed the others in reasoning and structure. Gemma stays loaded because I prefer working with visuals due to its superior image analysis. To be honest, whenever I try other models I always end up with Gemma and Gwen’s family, so this wasn’t too surprising.

If you want to try it out for yourself, the runner I use is PocketPal, available here Android and iOSand it integrates directly with the Hugging Face hub, so you have plenty of options to choose from.

Source link

I’ve run Gemma, Llama and Gwen on my phone and only one of them is worth keeping