Allow loading remote contents and showing images to get the best out of this email.FAUN.dev's AI/ML Weekly Newsletter
 
🔗 View in your browser   |  ✍️ Publish on FAUN.dev   |  🦄 Become a sponsor
 
Allow loading remote contents and showing images to get the best out of this email.
 
AILinks
 
This week in Generative AI/ML, with Kala the Koala
 
 
📝 A Few Words
 
 
I wrote a book so you can stop renting your AI

Last August, OpenAI retired GPT-4o overnight and moved everyone to GPT-5. People who had built their daily work around a fast, predictable model woke up to a slower one that behaved differently, with no way back. This June, a US export control directive forced Anthropic to cut Fable 5 and Mythos 5 for every customer at once. Teams lost the models they were building on in an afternoon, for reasons that had nothing to do with their own work.

None of those people did anything wrong. They just did not own what they ran. The model, the price, and the rules sat in someone else's hands, and any of the three could move without warning.

My new book exists because I got tired of that arrangement.

Local AI Engineering with Ollama: Run, understand, customize, fine-tune, and build agentic apps on your own hardware.

This is a practical book, not a survey of the field. With no history lessons or predictions about where AI is headed. You install the runtime, pull a model, and by the end you have three things sitting on your own disk:

  • A custom model packaged with a Modelfile that does one job the same way every time, that a teammate can pull and run with zero setup.
  • A fine-tuned model trained with QLoRA and Unsloth, then exported to GGUF and run in Ollama.
  • A chat application you build in nine passes until it becomes an advanced agent, with conversation history, streaming, context trimming, LangChain summarization, Redis caching, mem0 long-term memory, function calling, and tools served over MCP.
  • And more!

Along the way you learn what a model is actually doing (tokens, weights, embeddings, the KV cache, quantization), how to size a model against your RAM or VRAM before you download it, how to drive Ollama from its HTTP API, and how to control the context window so the model stops silently forgetting where a long chat started.

The stack you practice on: Ollama, Unsloth, LangChain, Redis, Docker, mem0, and Open WebUI.

Who it's for

If you can run a command and edit a file, you are qualified. No ML degree required, and none wanted. This is the book I needed when I started, written for the developer in the middle, past the marketing pages and short of the research papers.

What makes it different

Every command in the book was run on a real machine. Every output you see, the JSON responses, the error messages, the token counts, the training logs, came from an actual session, not from docs I trusted and pasted in. When Ollama behaved differently from its own documentation, I say so and pin the version it happened on...etc Where accuracy and polish pulled apart, accuracy won. That is the part that ages well.

28 modules, 91 sections, lifetime access and updates, a built-in AI assistant (SenseiOne) for your questions, and a 30-day money-back guarantee.

Get your copy

👉 On FAUN.sensei: Local AI Engineering with Ollama. Use code OLLAMA20 at checkout for 20% off. The code expires July 8, 2026 at 11:59 PM, so move before then.

👉 On Amazon.com: paperback and Kindle editions are live here (also available in the other marketplaces: .fr .de ..etc)

👑 If you want to stop renting and start owning, this is for you. Get a model running tonight, and keep going.

The rest follows from there.
 
 
🔍 Inside this Issue
 
 
AI agents are having a rough week: real-world exploits, sketchy supply chains, and a reminder that clever demos turn into incident tickets fast. On the brighter side, there are a couple of solid pieces here that make the models feel less magical and more engineerable.

🚨 7,000 Langflow servers are under attack. LangGraph and LangChain have the same holes
🧬 OpenClaw’s Skill Marketplace and the Emerging AI Supply Chain Threat
🧪 Don't let the LLM speak, just probe it
🧠 How LLMs Actually Work
🧵 Introducing Claude Tag
⚙️ Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
🔥 Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM

Ship smarter than the hype cycle.

Thanks for reading!
FAUN.dev() Team
 
 
⭐ Patrons
 
iacconf.com iacconf.com
 
Turn Terraform modules into self-service building blocks for humans and AI agents.
 
 
Terraform modules are often designed around what they do, not how easily humans or AI agents can use and reuse them. Join Jinger Meilani of MNTN to learn how to design IaC interfaces for humans, AI agents, and whatever comes next. Leave with concrete patterns that reduce misuse and help non-infrastructure developers get up to speed faster.

Register for free. July 14 | 12 PM EDT
 
 
👉 Spread the word and help developers find you by promoting your projects on FAUN. Get in touch for more information.
 
⭐ Sponsors
 
faun.dev faun.dev
 
Your Argo CD knowledge ends where production begins
 
 
The tutorials stop at "apply this, watch it sync". Then comes drift, RBAC lockouts, repo-server OOM, secrets sitting in Git. GitOps the Hard Way, with Argo CD covers that half: 12 chapters from an empty cluster to a working pipeline, every command tested live.

20% off with GITOPS20 until June 30 - get your copy now or grab the paperback on Amazon (search the title on your local Amazon elsewhere).
 
 
👉 Spread the word and help developers find you by promoting your projects on FAUN. Get in touch for more information.
 
🔗 Stories, Tutorials & Articles
 
venturebeat.com venturebeat.com
 
7,000 Langflow servers are under attack. LangGraph and LangChain have the same holes
 
 
Three popular AI agent frameworks had major vulnerabilities, from SQL injection to path traversal, allowing attackers to gain full remote code execution and access sensitive data. Exploits were publicly disclosed, and patches have been released for each framework.
 
 
unit42.paloaltonetworks.com unit42.paloaltonetworks.com
 
OpenClaw’s Skill Marketplace and the Emerging AI Supply Chain Threat
 
 
Unit 42 researchers found five malicious ClawHub skills that attackers had designed to pass the marketplace's post-incident automated checks.
 
 
anthropic.com anthropic.com
 
Introducing Claude Tag
 
 
Anthropic's Claude Tag beta gives Slack teams a shared agent they can tag in a channel, assign tasks to, and connect to approved tools.

Teams gain three practical benefits:

  • Claude can keep channel context, so teammates avoid re-explaining project history.
  • Admins can scope memory and tool access by channel.
  • Teams can treat Claude as a Slack collaborator with permissions and a task queue.

Admins should watch permission sprawl, stale memory, and unclear ownership when Claude acts through tools.
 
 
blog.j11y.io blog.j11y.io
 
Don't let the LLM speak, just probe it
 
 
When an LLM reads "here's some text, here's a criterion - does it satisfy it?", the answer often already exists in its hidden state before it generates a single token. So skip generation entirely: grab the hidden state at the last prompt token (~70% of the way up the model's layers), feed it to a tiny MLP, calibrate the output. Because the training data varies the criterion, you get one frozen model that acts as any classifier you can write in English.
 
 
0xkato.xyz 0xkato.xyz
 
How LLMs Actually Work   ✅
 
 
This post covers the core mechanisms inside modern transformer-based LLMs, including tokens, embeddings, positional encoding, attention, multi-head attention, and more. Tokenization converts text into integer IDs, embeddings give tokens meaning through vectors, and positional encoding helps the model understand the order of tokens. Attention allows tokens to share information with each other, and multi-head attention tracks different relationships simultaneously.
 
 
huggingface.co huggingface.co
 
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness   ✅
 
 
CUGA*, the Agent Harness for the Enterprise from IBM, streamlines agent building by handling planning, execution loop, tool calls, and state plumbing. Using it, you focus on defining tools and prompts while the rest is taken care of, leading to efficient agent development without needing to learn a new framework.
 
 

👉 Got something to share? Create your FAUN Page and start publishing your blog posts, tools, and updates. Grow your audience, and get discovered by the developer community.

 
🎦 Videos, Talks & Presentations
 
youtube.com youtube.com
 
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
 
 
Every non-technical executive thought 2024/2025 would be the year they replaced software engineers with large language models. The reality check has arrived as AI replacement projects hit a wall. Senior developers saw this collapse coming from a mile away.
 
 
 
⚙️ Tools, Apps & Software
 
github.com github.com
 
jmaczan/tiny-vllm
 
 
Build your own high performance LLM inference engine in C++ and CUDA - a smaller version of vLLM
 
 
github.com github.com
 
maziyarpanahi/openmed
 
 
Local-first healthcare AI: clinical NER & HIPAA PII de-identification that runs 100% on-device. 1,000+ medical models, 12 languages, Apple MLX + Python, no cloud, no patient data leaving your network. Apache-2.0
 
 
github.com github.com
 
RyanCodrai/turbovec
 
 
A vector index built on TurboQuant, written in Rust with Python bindings
 
 
github.com github.com
 
tastyeffectco/sandboxd
 
 
Self-hosted dev sandboxes with preview URLs. One command. No Kubernetes, perfect for coding agents and Saas factories
 
 
github.com github.com
 
BerriAI/litellm
 
 
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthropic, Sagemaker, HuggingFace, VLLM, NVIDIA NIM]
 
 

👉 Spread the word and help developers find and follow your Open Source project by promoting it on FAUN. Get in touch for more information.

 
🤔 Did you know?
 
 
Did you know that large language models can generate text faster using speculative decoding, where a small fast "draft" model guesses several tokens ahead and a larger model checks them all in a single pass? Tokens the big model agrees with are kept and the rest are recomputed, so the final text is exactly what the big model would have produced on its own, just with fewer slow sequential steps.
 
 
🤖 Once, SenseiOne Said
 
 
"Most ML bugs are fixed by a new dataset, then resurrected by the next one. MLOps is admitting your model is correct only until reality deploys a different input distribution. That's why the hardest part of AI is versioning what you didn't know you were assuming."

SenseiOne
 

(*) SenseiOne is FAUN.dev’s work-in-progress AI agent

 
😂 Meme of the week
 
 
 
 
❤️ Thanks for reading
 
 
👋 Keep in touch and follow us on social media:
- 💼LinkedIn
- 📝Medium
- 🐦Twitter
- 👥Facebook
- 📰Reddit
- 📸Instagram

👌 Was this newsletter helpful?
We'd really appreciate it if you could forward it to your friends!

🙏 Never miss an issue!
To receive our future emails in your inbox, don't forget to add community@faun.dev to your contacts.

🤩 Want to sponsor our newsletter?
Reach out to us at sponsors@faun.dev and we'll get back to you as soon as possible.
 

AILinks #534: How LLMs Actually Work
Legend: ✅ = Editor's Choice / ♻️ = Old but Gold / ⭐ = Promoted / 🔰 = Beginner Friendly

You received this email because you are subscribed to FAUN.dev.
We (🐾) help developers (👣) learn and grow by keeping them up with what matters.

You can manage your subscription options here (recommended) or use the old way here (legacy). If you have any problem, read this or reply to this email.