- Add Equatable conformance to TextStreamDelta.DeltaType enum - Fix switch exhaustiveness for channel cases in tests - Remove unused variable warnings in tests - Fix unreachable catch block warnings - Remove simplified architecture note from README - Update CLAUDE.md with Opus 4.1 references
9.8 KiB
9.8 KiB
GPT-OSS-120B Integration Guide
GPT-OSS-120B is OpenAI's open-source 120 billion parameter model, designed for high-quality text generation with advanced reasoning capabilities. Tachikoma provides seamless integration with this model through both Ollama and LMStudio.
Overview
GPT-OSS-120B offers:
- 120B parameters for nuanced understanding
- 128K context window for long conversations
- Chain-of-thought reasoning with multi-channel responses
- Tool calling support for function execution
- Multiple quantizations from Q4 (65GB) to FP16 (240GB)
Hardware Requirements
Minimum Requirements
- RAM: 32GB (Q4_0), 64GB (Q4_K_M)
- GPU: 8GB VRAM (partial offload)
- Storage: 70GB free space
- CPU: 8-core modern processor
Recommended Setup
- RAM: 64GB or more
- GPU: 24GB VRAM (RTX 3090/4090, M2 Max/Ultra)
- Storage: NVMe SSD with 150GB free
- CPU: Apple Silicon or recent Intel/AMD
Optimal Performance
- RAM: 128GB
- GPU: 48GB VRAM or dual GPUs
- Storage: 2TB NVMe SSD
- Platform: Apple M2 Ultra or dual RTX 4090
Installation
Via Ollama
# Method 1: Pull pre-built model
ollama pull gpt-oss-120b:q4_k_m
# Method 2: Import from GGUF
ollama create gpt-oss-120b -f ./Modelfile
Modelfile Configuration
FROM ./gpt-oss-120b-q4_k_m.gguf
# Model parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.95
PARAMETER top_k 40
PARAMETER num_ctx 32768 # Start with 32K context
PARAMETER num_gpu 999 # Use all available GPU layers
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|endoftext|>"
PARAMETER stop "<|im_end|>"
# System prompt for Harmony features
SYSTEM """
You are GPT-OSS-120B, an advanced language model with reasoning capabilities.
When solving complex problems or answering questions:
- Use <thinking> tags for your internal reasoning process
- Use <analysis> tags for breaking down complex issues
- Use <commentary> tags for meta-level observations
- Use <final> tags for your conclusive response
Always structure your responses clearly and think step-by-step.
"""
# Template for conversations
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
{{ end }}"""
Via LMStudio
-
Download the model:
- Open LMStudio
- Search for "gpt-oss-120b"
- Select quantization (Q4_K_M recommended)
- Click Download
-
Configure settings:
{ "context_length": 16384, "gpu_layers": -1, "temperature": 0.7, "top_p": 0.95, "repeat_penalty": 1.1, "batch_size": 512 }
Usage Examples
Basic Generation
import Tachikoma
// Simple generation
let response = try await generate(
"Explain the theory of relativity",
using: .gptOSS120B
)
print(response)
With Reasoning Chains
// High-effort reasoning for complex problems
let detailed = try await generateText(
model: .gptOSS120B,
messages: [
.system("You are a helpful assistant that shows your reasoning."),
.user("What would happen if we could travel faster than light?")
],
settings: GenerationSettings(
reasoningEffort: .high,
maxTokens: 4096,
temperature: 0.8
)
)
// Access different reasoning channels
if let thinking = detailed.channels[.thinking] {
print("Internal reasoning:", thinking)
}
if let analysis = detailed.channels[.analysis] {
print("Analysis:", analysis)
}
print("Final answer:", detailed.channels[.final] ?? detailed.text)
Tool Calling
@ToolKit
struct MathTools {
func calculate(expression: String) -> Double {
// Implementation
}
func plotGraph(function: String, range: ClosedRange<Double>) -> String {
// Implementation
}
}
let response = try await generateText(
model: .gptOSS120B,
messages: [.user("Calculate the area under the curve y=x² from 0 to 5")],
tools: MathTools(),
settings: GenerationSettings(reasoningEffort: .medium)
)
Streaming Responses
// Stream with buffering for better performance
for try await delta in streamText(
model: .gptOSS120B,
messages: conversation,
settings: GenerationSettings(
streamingBufferSize: 10 // Buffer 10 tokens
)
) {
switch delta.type {
case .channelStart(let channel):
print("Starting \(channel):")
case .textDelta(let text):
print(text, terminator: "")
case .channelEnd(let channel):
print("\nFinished \(channel)")
default:
break
}
}
With Caching
// Use aggressive caching for local models
let cachedProvider = ResponseCache.localModelCache.wrap(
OllamaProvider(model: "gpt-oss-120b")
)
// Repeated queries will be instant
let response1 = try await generateText(
model: .gptOSS120B,
messages: [.user("What is Swift?")],
provider: cachedProvider
)
// This will use the cache
let response2 = try await generateText(
model: .gptOSS120B,
messages: [.user("What is Swift?")],
provider: cachedProvider
)
Performance Optimization
Memory Management
// Configure memory limits
LocalModelMemoryManager.shared.configure(
maxMemoryGB: 48,
autoUnloadMinutes: 10
)
// Preload model for better first-response time
try await LocalModelLoader.preload(.gptOSS120B)
// Explicitly manage model lifecycle
try await LocalModelLoader.load(.gptOSS120B)
defer {
Task { try await LocalModelLoader.unload(.gptOSS120B) }
}
Context Window Management
// Adaptive context sizing based on available memory
let optimalContext = LocalModelOptimizer.calculateOptimalContext(
model: .gptOSS120B,
availableRAM: ProcessInfo.processInfo.physicalMemory
)
let settings = GenerationSettings(
maxContextTokens: optimalContext,
truncationStrategy: .keepRecent // Keep most recent messages
)
GPU Acceleration
// Configure GPU usage
let config = LocalModelConfig(
gpuLayers: .auto, // Automatically determine
metalAcceleration: true, // Use Metal on macOS
cudaDevices: [0, 1], // Use multiple GPUs if available
cpuThreads: 8 // Fallback CPU threads
)
Quantization Guide
| Quantization | Size | RAM Required | Quality | Speed | Use Case |
|---|---|---|---|---|---|
| Q4_0 | 65GB | 32GB | Good | Fast | General use |
| Q4_K_M | 67GB | 32GB | Better | Fast | Recommended |
| Q5_K_M | 82GB | 48GB | Very Good | Medium | Quality focus |
| Q6_K | 98GB | 64GB | Excellent | Slower | Research |
| Q8_0 | 127GB | 96GB | Near Perfect | Slow | Maximum quality |
| FP16 | 240GB | 128GB+ | Perfect | Very Slow | Development only |
Troubleshooting
Common Issues
Model won't load:
// Check available memory
let memoryStatus = try await LocalModelDiagnostics.checkMemory()
print("Available RAM: \(memoryStatus.availableGB)GB")
print("Required: \(memoryStatus.requiredGB)GB")
// Try smaller context
let settings = GenerationSettings(maxContextTokens: 4096)
Slow generation:
// Enable GPU acceleration
try await OllamaProvider.configure(
gpuLayers: 60, // Offload more layers to GPU
useMlock: true // Lock model in RAM
)
// Use streaming for better perceived performance
let stream = try await streamText(model: .gptOSS120B, ...)
Out of memory errors:
// Use automatic memory management
LocalModelMemoryManager.shared.enableAutoUnload()
// Or manually clear cache
await ResponseCache.localModelCache.clear()
Performance Metrics
Monitor model performance:
let metrics = try await LocalModelMetrics.measure(model: .gptOSS120B) {
try await generate("Test prompt", using: .gptOSS120B)
}
print("Time to first token: \(metrics.timeToFirstToken)ms")
print("Tokens per second: \(metrics.tokensPerSecond)")
print("Memory used: \(metrics.memoryUsedGB)GB")
Best Practices
- Start with smaller context: Begin with 8K-16K context and increase gradually
- Use appropriate quantization: Q4_K_M offers best quality/performance balance
- Enable caching: Local models benefit greatly from response caching
- Monitor memory: Keep 20% RAM free for system stability
- Adjust reasoning effort: Use
.lowfor simple queries to save resources - Batch similar requests: Process related queries together
- Preload for production: Load model before user requests
Advanced Configuration
Custom Ollama Models
Create specialized variants:
# High-creativity variant
cat > creative.Modelfile << EOF
FROM gpt-oss-120b:q4_k_m
PARAMETER temperature 1.2
PARAMETER top_p 0.98
PARAMETER repeat_penalty 0.9
EOF
ollama create gpt-oss-creative -f creative.Modelfile
Fine-tuning Integration
// Use fine-tuned variants
let customModel = LanguageModel.ollama(
OllamaModel(
name: "gpt-oss-120b-medical",
baseModel: "gpt-oss-120b",
adapterPath: "~/models/medical-adapter.bin"
)
)
Migration Guide
From GPT-4
// Before
let response = try await generateText(
model: .openai(.gpt4),
messages: messages,
apiKey: "sk-..."
)
// After
let response = try await generateText(
model: .gptOSS120B,
messages: messages
// No API key needed!
)
From Claude
// Before
let response = try await generateText(
model: .anthropic(.claude3),
messages: messages
)
// After - with reasoning chains
let response = try await generateText(
model: .gptOSS120B,
messages: messages,
settings: GenerationSettings(
reasoningEffort: .high // Similar to Claude's thinking
)
)
Related Documentation
- LMStudio Integration - Alternative local hosting
- OpenAI Harmony Features - Multi-channel responses
- Performance Tuning - Optimization guide