目录

ollama run启动时到底做了哪些事情

下面是启动ollama serve 后、ollama run qwq:32b 启动的日志。

 [GIN] 2025/03/09 - 08:04:36 | 200 |            0s |  192.168.11.100 | HEAD     "/"
[GIN] 2025/03/09 - 08:04:36 | 200 |     12.6348ms |  192.168.11.100 | POST     "/api/show"
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.409+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb gpu=GPU-213bb6a8-2adc-9591-a814-ae650a18a4f2 parallel=4 available=23918694400 required="21.5 GiB"
time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB" free_swap="22.0 GiB"
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[22.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="D:\\Software\\AI\\ollama\\ollama.exe runner --model D:\\Software\\AI\\ollama\\model\\blobs\\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 8 --no-mmap --parallel 4 --port 52036"
time=2025-03-09T08:04:36.426+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T08:04:36.426+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T08:04:36.427+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T08:04:36.443+08:00 level=INFO source=runner.go:932 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from D:\Software\AI\ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from D:\Software\AI\ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:52036"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23008 MiB free
time=2025-03-09T08:04:36.677+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 18.48 GiB (4.85 BPW)
llm_load_print_meta: general.name     = QwQ 32B
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors:          CPU model buffer size =   417.66 MiB
llm_load_tensors:        CUDA0 model buffer size = 18508.35 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.40 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   696.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"

以下是基于日志的 Ollama 运行流程图,按 硬件检测 → 资源分配 → 模型加载 → 服务启动 分阶段详细展开:


[用户输入命令]
[1. 硬件检测]
   ├── 检查 GPU 型号(RTX 4090)及显存(23GB)
   ├── 检查 CPU 核心数、系统内存(31.8GB 总内存,18.7GB 空闲)
   └── 检查 CUDA 环境(支持 compute capability 8.9)
[2. 资源分配]
   ├── 显存分配策略:
   │    ├── 模型总需求:21.5GB
   │    ├── 卸载全部 65 层到 GPU(含注意力层和输出层)
   │    └── 保留 417MB CPU 缓冲区
   └── 内存管理:
        ├── 系统内存:18.7GB 空闲
        └── 交换分区:22GB 空闲
[3. 模型加载]
   ├── 验证模型文件(SHA256 校验)
   ├── 加载 GGUF 格式模型:
   │    ├── 元数据(33 个键值对)
   │    ├── 张量参数(771 个,含 Q4_K 和 Q6_K 量化类型)
   │    └── Tokenizer 配置(BPE 分词,152k 词汇表)
   ├── 分层卸载到 GPU:
   │    ├── 64 个重复层(Repeating Layers)
   │    └── 1 个输出层(Non-repeating Layer)
   └── 初始化 KV 缓存(2GB 显存)
[4. 服务启动]
   ├── 启动 Llama 服务:
   │    ├── 绑定端口(127.0.0.1:52036)
   │    ├── 设置上下文参数(n_ctx=8192, batch-size=512)
   │    └── 启用 CUDA 加速(cuBLAS、Flash Attention)
   ├── 预热模型(Warm-up):
   │    ├── 初始化 CUDA 图(Graph Splits=2)
   │    └── 测试推理响应
   └── 监听 API 请求(/api/show 等端点)
启动成功[等待用户输入]

  • 日志依据
    ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
    time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB"
  • 细节
    • 检测到 RTX 4090 显卡(23GB 显存)和系统内存(31.8GB 总,18.7GB 空闲)。
    • 确认 CUDA 环境支持(compute capability 8.9 表示 Ampere 架构)。

  • 日志依据
    time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload ... memory.required.allocations="[21.5 GiB]"
  • 细节
    • 显存分配:
      • 模型总需求 21.5GB,其中 18.5GB 为权重参数,2GB 为 KV 缓存。
      • 保留 417MB CPU 缓冲区用于临时计算。
    • 内存策略:
      • 优先 GPU 卸载,剩余参数保留在 CPU 内存。

  • 日志依据
    llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors ... (version GGUF V3)
    llm_load_tensors: offloaded 65/65 layers to GPU
  • 细节
    • 元数据:模型架构(Qwen2)、分词器配置(BPE)、上下文长度(131k)。
    • 张量加载
      • 321 个 FP32 张量(精确计算)。
      • 385 个 Q4_K 量化张量(高压缩率)。
      • 65 个 Q6_K 量化张量(平衡精度与速度)。
    • KV 缓存:分配 2GB 显存用于存储注意力键值对。

  • 日志依据
    time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="..."
    time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"
  • 细节
    • 服务配置
      • 最大上下文长度 8192 token(单次对话上限)。
      • 批处理大小 512 token(优化吞吐量)。
    • 加速策略
      • 启用 CUDA 图(减少内核启动开销)。
      • 未启用 Flash Attention(可能因模型格式限制)。
    • 启动耗时:约 6.76 秒(包含预热时间)。

  1. 硬件检测:先看看电脑的显卡、内存够不够用。
  2. 资源分配:把模型大部分塞进显卡,小部分留在内存。
  3. 模型加载:读取模型文件,处理参数和分词器,分配显存。
  4. 服务启动:开个本地服务器,设置好参数,等待用户提问。

整个过程像拼乐高:先检查零件(硬件),再规划怎么拼(资源分配),然后组装(模型加载),最后插上电源(服务启动)!


当你在终端输入ollama run qwq:32b,就像按下火箭发射按钮。系统会像拼乐高一样,先检查你的电脑配置(显卡够不够大?内存够不够用?),然后把模型文件拆分成显卡能处理的"积木块",最后组装成能回答问题的AI引擎。整个过程涉及硬件检测、资源分配、模型加载、服务启动四个核心阶段。

Ollama启动流程图

系统首先扫描CUDA设备,检测到NVIDIA RTX 4090显卡(显存23GB)[[10]],同时验证其计算能力(Compute Capability 8.9)是否支持模型运算。这一步决定了能否启用GPU加速,否则会回退到CPU模式。

通过系统调用获取:

  • 物理内存:31.8GB(可用18.7GB)
  • 虚拟内存:22GB交换空间
  • GPU显存:23GB(可用21.5GB) 这些数据直接影响模型加载策略,比如是否启用显存分页[[3]]。
资源类型分配量用途
显存主缓冲区18.5GB存储模型权重
KV缓存2GB保存注意力机制中间结果
CUDA计算缓冲区696MB临时计算空间
CPU保留缓冲区417MB应急内存溢出处理

系统采用分层卸载策略,将65个模型层完全加载到GPU:

  • 64个重复层(Transformer块)优先占用显存
  • 1个输出层保留显存热点区域
  • 量化参数(Q4_K/Q6_K)动态调整内存布局[[5]]

sha256-c62ccde563模型文件中提取:

  • 33个元数据键值对(含模型架构、分词规则)
  • 771个张量(321个FP32 + 385个Q4_K + 65个Q6_K)
  • 152k词汇表的BPE分词器[[7]]
graph TD
    A[GGUF文件] --> B[元数据解析]
    A --> C[张量加载]
    B --> D{验证模型参数}
    C --> E[显存映射]
    D -->|校验通过| E

关键参数配置:

{
  "n_ctx": 8192,        // 最大上下文长度
  "n_batch": 512,       // 批处理大小
  "n_gpu_layers": 65,   // GPU卸载层数
  "flash_attn": false   // 未启用Flash Attention
}
  • 开启HTTP服务端口(如52036)
  • 注册API端点:/api/show用于推理请求
  • 启动CUDA图优化(2个分割段)提升吞吐量[[8]]
  • 执行空循环推理验证服务稳定性
  • 初始化KV缓存预分配
  • 检测CUDA内存泄漏(通过676MB显存波动监控)[[9]]
  1. 显存不足处理:当模型超过单卡容量时,需修改n_gpu_layers参数进行分页加载
  2. 性能调优:批量大小(batch-size)建议设置为显存容量的1/4
  3. 版本兼容:GGUF v3格式需Ollama 0.2.0以上版本支持

通过ollama run命令,开发者可以像管理Docker容器一样操作大模型,将复杂的分布式计算简化为单机操作。这种抽象层设计,使得部署130B参数模型的门槛从博士级降到入门级[[6]]。