ollama run启动时到底做了哪些事情
目录
下面是启动ollama serve 后、ollama run qwq:32b 启动的日志。
[GIN] 2025/03/09 - 08:04:36 | 200 | 0s | 192.168.11.100 | HEAD "/"
[GIN] 2025/03/09 - 08:04:36 | 200 | 12.6348ms | 192.168.11.100 | POST "/api/show"
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.409+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb gpu=GPU-213bb6a8-2adc-9591-a814-ae650a18a4f2 parallel=4 available=23918694400 required="21.5 GiB"
time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB" free_swap="22.0 GiB"
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[22.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="D:\\Software\\AI\\ollama\\ollama.exe runner --model D:\\Software\\AI\\ollama\\model\\blobs\\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 8 --no-mmap --parallel 4 --port 52036"
time=2025-03-09T08:04:36.426+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T08:04:36.426+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T08:04:36.427+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T08:04:36.443+08:00 level=INFO source=runner.go:932 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from D:\Software\AI\ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from D:\Software\AI\ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:52036"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23008 MiB free
time=2025-03-09T08:04:36.677+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = QwQ 32B
llama_model_loader: - kv 3: general.basename str = QwQ
llama_model_loader: - kv 4: general.size_label str = 32B
llama_model_loader: - kv 5: general.license str = apache-2.0
llama_model_loader: - kv 6: general.license.link str = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv 7: general.base_model.count u32 = 1
llama_model_loader: - kv 8: general.base_model.0.name str = Qwen2.5 32B
llama_model_loader: - kv 9: general.base_model.0.organization str = Qwen
llama_model_loader: - kv 10: general.base_model.0.repo_url str = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv 11: general.tags arr[str,2] = ["chat", "text-generation"]
llama_model_loader: - kv 12: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 13: qwen2.block_count u32 = 64
llama_model_loader: - kv 14: qwen2.context_length u32 = 131072
llama_model_loader: - kv 15: qwen2.embedding_length u32 = 5120
llama_model_loader: - kv 16: qwen2.feed_forward_length u32 = 27648
llama_model_loader: - kv 17: qwen2.attention.head_count u32 = 40
llama_model_loader: - kv 18: qwen2.attention.head_count_kv u32 = 8
llama_model_loader: - kv 19: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 20: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 21: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 22: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 23: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 24: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 25: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 26: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 27: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 28: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 29: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 30: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 31: general.quantization_version u32 = 2
llama_model_loader: - kv 32: general.file_type u32 = 15
llama_model_loader: - type f32: 321 tensors
llama_model_loader: - type q4_K: 385 tensors
llama_model_loader: - type q6_K: 65 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen2
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 152064
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 131072
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 5
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 27648
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 131072
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: ssm_dt_b_c_rms = 0
llm_load_print_meta: model type = 32B
llm_load_print_meta: model ftype = Q4_K - Medium
llm_load_print_meta: model params = 32.76 B
llm_load_print_meta: model size = 18.48 GiB (4.85 BPW)
llm_load_print_meta: general.name = QwQ 32B
llm_load_print_meta: BOS token = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151643 '<|endoftext|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token = 151645 '<|im_end|>'
llm_load_print_meta: EOG token = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: CPU model buffer size = 417.66 MiB
llm_load_tensors: CUDA0 model buffer size = 18508.35 MiB
llama_new_context_with_model: n_seq_max = 4
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init: CUDA0 KV buffer size = 2048.00 MiB
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 2.40 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 696.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 26.01 MiB
llama_new_context_with_model: graph nodes = 2246
llama_new_context_with_model: graph splits = 2
time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"
以下是基于日志的 Ollama 运行流程图,按 硬件检测 → 资源分配 → 模型加载 → 服务启动
分阶段详细展开:
0.1 流程图
[用户输入命令]
│
▼
[1. 硬件检测]
├── 检查 GPU 型号(RTX 4090)及显存(23GB)
├── 检查 CPU 核心数、系统内存(31.8GB 总内存,18.7GB 空闲)
└── 检查 CUDA 环境(支持 compute capability 8.9)
│
▼
[2. 资源分配]
├── 显存分配策略:
│ ├── 模型总需求:21.5GB
│ ├── 卸载全部 65 层到 GPU(含注意力层和输出层)
│ └── 保留 417MB CPU 缓冲区
└── 内存管理:
├── 系统内存:18.7GB 空闲
└── 交换分区:22GB 空闲
│
▼
[3. 模型加载]
├── 验证模型文件(SHA256 校验)
├── 加载 GGUF 格式模型:
│ ├── 元数据(33 个键值对)
│ ├── 张量参数(771 个,含 Q4_K 和 Q6_K 量化类型)
│ └── Tokenizer 配置(BPE 分词,152k 词汇表)
├── 分层卸载到 GPU:
│ ├── 64 个重复层(Repeating Layers)
│ └── 1 个输出层(Non-repeating Layer)
└── 初始化 KV 缓存(2GB 显存)
│
▼
[4. 服务启动]
├── 启动 Llama 服务:
│ ├── 绑定端口(127.0.0.1:52036)
│ ├── 设置上下文参数(n_ctx=8192, batch-size=512)
│ └── 启用 CUDA 加速(cuBLAS、Flash Attention)
├── 预热模型(Warm-up):
│ ├── 初始化 CUDA 图(Graph Splits=2)
│ └── 测试推理响应
└── 监听 API 请求(/api/show 等端点)
│
▼
启动成功[等待用户输入]
0.2 关键步骤详解
0.2.1 1. 硬件检测
- 日志依据:
ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9 time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB"
- 细节:
- 检测到 RTX 4090 显卡(23GB 显存)和系统内存(31.8GB 总,18.7GB 空闲)。
- 确认 CUDA 环境支持(compute capability 8.9 表示 Ampere 架构)。
0.2.2 2. 资源分配
- 日志依据:
time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload ... memory.required.allocations="[21.5 GiB]"
- 细节:
- 显存分配:
- 模型总需求 21.5GB,其中 18.5GB 为权重参数,2GB 为 KV 缓存。
- 保留 417MB CPU 缓冲区用于临时计算。
- 内存策略:
- 优先 GPU 卸载,剩余参数保留在 CPU 内存。
- 显存分配:
0.2.3 3. 模型加载
- 日志依据:
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors ... (version GGUF V3) llm_load_tensors: offloaded 65/65 layers to GPU
- 细节:
- 元数据:模型架构(Qwen2)、分词器配置(BPE)、上下文长度(131k)。
- 张量加载:
- 321 个 FP32 张量(精确计算)。
- 385 个 Q4_K 量化张量(高压缩率)。
- 65 个 Q6_K 量化张量(平衡精度与速度)。
- KV 缓存:分配 2GB 显存用于存储注意力键值对。
0.2.4 4. 服务启动
- 日志依据:
time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="..." time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"
- 细节:
- 服务配置:
- 最大上下文长度 8192 token(单次对话上限)。
- 批处理大小 512 token(优化吞吐量)。
- 加速策略:
- 启用 CUDA 图(减少内核启动开销)。
- 未启用 Flash Attention(可能因模型格式限制)。
- 启动耗时:约 6.76 秒(包含预热时间)。
- 服务配置:
0.3 Ollama 运行流程图
- 硬件检测:先看看电脑的显卡、内存够不够用。
- 资源分配:把模型大部分塞进显卡,小部分留在内存。
- 模型加载:读取模型文件,处理参数和分词器,分配显存。
- 服务启动:开个本地服务器,设置好参数,等待用户提问。
整个过程像拼乐高:先检查零件(硬件),再规划怎么拼(资源分配),然后组装(模型加载),最后插上电源(服务启动)!
1 通俗解释:敲下ollama run
后发生了什么?
当你在终端输入ollama run qwq:32b
,就像按下火箭发射按钮。系统会像拼乐高一样,先检查你的电脑配置(显卡够不够大?内存够不够用?),然后把模型文件拆分成显卡能处理的"积木块",最后组装成能回答问题的AI引擎。整个过程涉及硬件检测、资源分配、模型加载、服务启动四个核心阶段。
2 一、硬件环境检测
2.1 1.1 显卡能力评估
系统首先扫描CUDA设备,检测到NVIDIA RTX 4090显卡(显存23GB)[[10]],同时验证其计算能力(Compute Capability 8.9)是否支持模型运算。这一步决定了能否启用GPU加速,否则会回退到CPU模式。
2.2 1.2 内存资源审计
通过系统调用获取:
- 物理内存:31.8GB(可用18.7GB)
- 虚拟内存:22GB交换空间
- GPU显存:23GB(可用21.5GB) 这些数据直接影响模型加载策略,比如是否启用显存分页[[3]]。
3 二、资源分配策略
3.1 2.1 显存分配表
资源类型 | 分配量 | 用途 |
---|---|---|
显存主缓冲区 | 18.5GB | 存储模型权重 |
KV缓存 | 2GB | 保存注意力机制中间结果 |
CUDA计算缓冲区 | 696MB | 临时计算空间 |
CPU保留缓冲区 | 417MB | 应急内存溢出处理 |
3.2 2.2 模型分层卸载
系统采用分层卸载策略,将65个模型层完全加载到GPU:
- 64个重复层(Transformer块)优先占用显存
- 1个输出层保留显存热点区域
- 量化参数(Q4_K/Q6_K)动态调整内存布局[[5]]
4 三、模型加载过程
4.1 3.1 GGUF格式解析
从sha256-c62ccde563
模型文件中提取:
- 33个元数据键值对(含模型架构、分词规则)
- 771个张量(321个FP32 + 385个Q4_K + 65个Q6_K)
- 152k词汇表的BPE分词器[[7]]
graph TD A[GGUF文件] --> B[元数据解析] A --> C[张量加载] B --> D{验证模型参数} C --> E[显存映射] D -->|校验通过| E
4.2 3.2 上下文初始化
关键参数配置:
{
"n_ctx": 8192, // 最大上下文长度
"n_batch": 512, // 批处理大小
"n_gpu_layers": 65, // GPU卸载层数
"flash_attn": false // 未启用Flash Attention
}
5 四、服务启动阶段
5.1 4.1 网络服务绑定
- 开启HTTP服务端口(如52036)
- 注册API端点:
/api/show
用于推理请求 - 启动CUDA图优化(2个分割段)提升吞吐量[[8]]
5.2 4.2 预热与健康检查
- 执行空循环推理验证服务稳定性
- 初始化KV缓存预分配
- 检测CUDA内存泄漏(通过676MB显存波动监控)[[9]]
6 实战注意事项
- 显存不足处理:当模型超过单卡容量时,需修改
n_gpu_layers
参数进行分页加载 - 性能调优:批量大小(batch-size)建议设置为显存容量的1/4
- 版本兼容:GGUF v3格式需Ollama 0.2.0以上版本支持
通过
ollama run
命令,开发者可以像管理Docker容器一样操作大模型,将复杂的分布式计算简化为单机操作。这种抽象层设计,使得部署130B参数模型的门槛从博士级降到入门级[[6]]。