ollama run启动时到底做了哪些事情

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
 [GIN] 2025/03/09 - 08:04:36 | 200 |            0s |  192.168.11.100 | HEAD     "/"
[GIN] 2025/03/09 - 08:04:36 | 200 |     12.6348ms |  192.168.11.100 | POST     "/api/show"
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.409+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.409+08:00 level=INFO source=sched.go:715 msg="new model will fit in available VRAM in single GPU, loading" model=D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb gpu=GPU-213bb6a8-2adc-9591-a814-ae650a18a4f2 parallel=4 available=23918694400 required="21.5 GiB"
time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB" free_swap="22.0 GiB"
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.key_length default=128
time=2025-03-09T08:04:36.421+08:00 level=WARN source=ggml.go:132 msg="key not found" key=qwen2.attention.value_length default=128
time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload library=cuda layers.requested=-1 layers.model=65 layers.offload=65 layers.split="" memory.available="[22.3 GiB]" memory.gpu_overhead="0 B" memory.required.full="21.5 GiB" memory.required.partial="21.5 GiB" memory.required.kv="2.0 GiB" memory.required.allocations="[21.5 GiB]" memory.weights.total="19.5 GiB" memory.weights.repeating="18.9 GiB" memory.weights.nonrepeating="609.1 MiB" memory.graph.full="676.0 MiB" memory.graph.partial="916.1 MiB"
time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="D:\\Software\\AI\\ollama\\ollama.exe runner --model D:\\Software\\AI\\ollama\\model\\blobs\\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb --ctx-size 8192 --batch-size 512 --n-gpu-layers 65 --threads 8 --no-mmap --parallel 4 --port 52036"
time=2025-03-09T08:04:36.426+08:00 level=INFO source=sched.go:450 msg="loaded runners" count=1
time=2025-03-09T08:04:36.426+08:00 level=INFO source=server.go:557 msg="waiting for llama runner to start responding"
time=2025-03-09T08:04:36.427+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server error"
time=2025-03-09T08:04:36.443+08:00 level=INFO source=runner.go:932 msg="starting go runner"
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
load_backend: loaded CUDA backend from D:\Software\AI\ollama\lib\ollama\cuda_v12\ggml-cuda.dll
load_backend: loaded CPU backend from D:\Software\AI\ollama\lib\ollama\ggml-cpu-alderlake.dll
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 600,610,620,700,720,750,800,860,870,890,900 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | cgo(clang)" threads=8
time=2025-03-09T08:04:36.596+08:00 level=INFO source=runner.go:993 msg="Server listening on 127.0.0.1:52036"
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23008 MiB free
time=2025-03-09T08:04:36.677+08:00 level=INFO source=server.go:591 msg="waiting for server to become available" status="llm server loading model"
llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors from D:\Software\AI\ollama\model\blobs\sha256-c62ccde5630c20c8a9cf601861d31977d07450cad6dfdf1c661aab307107bddb (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = QwQ 32B
llama_model_loader: - kv   3:                           general.basename str              = QwQ
llama_model_loader: - kv   4:                         general.size_label str              = 32B
llama_model_loader: - kv   5:                            general.license str              = apache-2.0
llama_model_loader: - kv   6:                       general.license.link str              = https://huggingface.co/Qwen/QWQ-32B/b...
llama_model_loader: - kv   7:                   general.base_model.count u32              = 1
llama_model_loader: - kv   8:                  general.base_model.0.name str              = Qwen2.5 32B
llama_model_loader: - kv   9:          general.base_model.0.organization str              = Qwen
llama_model_loader: - kv  10:              general.base_model.0.repo_url str              = https://huggingface.co/Qwen/Qwen2.5-32B
llama_model_loader: - kv  11:                               general.tags arr[str,2]       = ["chat", "text-generation"]
llama_model_loader: - kv  12:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  13:                          qwen2.block_count u32              = 64
llama_model_loader: - kv  14:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  15:                     qwen2.embedding_length u32              = 5120
llama_model_loader: - kv  16:                  qwen2.feed_forward_length u32              = 27648
llama_model_loader: - kv  17:                 qwen2.attention.head_count u32              = 40
llama_model_loader: - kv  18:              qwen2.attention.head_count_kv u32              = 8
llama_model_loader: - kv  19:                       qwen2.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  20:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  27:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  28:                tokenizer.ggml.bos_token_id u32              = 151643
llama_model_loader: - kv  29:               tokenizer.ggml.add_bos_token bool             = false
llama_model_loader: - kv  30:                    tokenizer.chat_template str              = {%- if tools %}\n    {{- '<|im_start|>...
llama_model_loader: - kv  31:               general.quantization_version u32              = 2
llama_model_loader: - kv  32:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  321 tensors
llama_model_loader: - type q4_K:  385 tensors
llama_model_loader: - type q6_K:   65 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = qwen2
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 152064
llm_load_print_meta: n_merges         = 151387
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_layer          = 64
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 5
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 27648
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 32B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 32.76 B
llm_load_print_meta: model size       = 18.48 GiB (4.85 BPW)
llm_load_print_meta: general.name     = QwQ 32B
llm_load_print_meta: BOS token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOS token        = 151645 '<|im_end|>'
llm_load_print_meta: EOT token        = 151645 '<|im_end|>'
llm_load_print_meta: PAD token        = 151643 '<|endoftext|>'
llm_load_print_meta: LF token         = 148848 'ÄĬ'
llm_load_print_meta: FIM PRE token    = 151659 '<|fim_prefix|>'
llm_load_print_meta: FIM SUF token    = 151661 '<|fim_suffix|>'
llm_load_print_meta: FIM MID token    = 151660 '<|fim_middle|>'
llm_load_print_meta: FIM PAD token    = 151662 '<|fim_pad|>'
llm_load_print_meta: FIM REP token    = 151663 '<|repo_name|>'
llm_load_print_meta: FIM SEP token    = 151664 '<|file_sep|>'
llm_load_print_meta: EOG token        = 151643 '<|endoftext|>'
llm_load_print_meta: EOG token        = 151645 '<|im_end|>'
llm_load_print_meta: EOG token        = 151662 '<|fim_pad|>'
llm_load_print_meta: EOG token        = 151663 '<|repo_name|>'
llm_load_print_meta: EOG token        = 151664 '<|file_sep|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors:          CPU model buffer size =   417.66 MiB
llm_load_tensors:        CUDA0 model buffer size = 18508.35 MiB
llama_new_context_with_model: n_seq_max     = 4
llama_new_context_with_model: n_ctx         = 8192
llama_new_context_with_model: n_ctx_per_seq = 2048
llama_new_context_with_model: n_batch       = 2048
llama_new_context_with_model: n_ubatch      = 512
llama_new_context_with_model: flash_attn    = 0
llama_new_context_with_model: freq_base     = 1000000.0
llama_new_context_with_model: freq_scale    = 1
llama_new_context_with_model: n_ctx_per_seq (2048) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 64, can_shift = 1
llama_kv_cache_init:      CUDA0 KV buffer size =  2048.00 MiB
llama_new_context_with_model: KV self size  = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     2.40 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   696.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    26.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"

以下是基于日志的 Ollama 运行流程图,按 硬件检测 → 资源分配 → 模型加载 → 服务启动 分阶段详细展开:


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[用户输入命令]
[1. 硬件检测]
   ├── 检查 GPU 型号(RTX 4090)及显存(23GB)
   ├── 检查 CPU 核心数、系统内存(31.8GB 总内存,18.7GB 空闲)
   └── 检查 CUDA 环境(支持 compute capability 8.9)
[2. 资源分配]
   ├── 显存分配策略:
   │    ├── 模型总需求:21.5GB
   │    ├── 卸载全部 65 层到 GPU(含注意力层和输出层)
   │    └── 保留 417MB CPU 缓冲区
   └── 内存管理:
        ├── 系统内存:18.7GB 空闲
        └── 交换分区:22GB 空闲
[3. 模型加载]
   ├── 验证模型文件(SHA256 校验)
   ├── 加载 GGUF 格式模型:
   │    ├── 元数据(33 个键值对)
   │    ├── 张量参数(771 个,含 Q4_K 和 Q6_K 量化类型)
   │    └── Tokenizer 配置(BPE 分词,152k 词汇表)
   ├── 分层卸载到 GPU:
   │    ├── 64 个重复层(Repeating Layers)
   │    └── 1 个输出层(Non-repeating Layer)
   └── 初始化 KV 缓存(2GB 显存)
[4. 服务启动]
   ├── 启动 Llama 服务:
   │    ├── 绑定端口(127.0.0.1:52036)
   │    ├── 设置上下文参数(n_ctx=8192, batch-size=512)
   │    └── 启用 CUDA 加速(cuBLAS、Flash Attention)
   ├── 预热模型(Warm-up):
   │    ├── 初始化 CUDA 图(Graph Splits=2)
   │    └── 测试推理响应
   └── 监听 API 请求(/api/show 等端点)
启动成功[等待用户输入]

  • 日志依据
    1
    2
    
    ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
    time=2025-03-09T08:04:36.421+08:00 level=INFO source=server.go:97 msg="system memory" total="31.8 GiB" free="18.7 GiB"
    
  • 细节
    • 检测到 RTX 4090 显卡(23GB 显存)和系统内存(31.8GB 总,18.7GB 空闲)。
    • 确认 CUDA 环境支持(compute capability 8.9 表示 Ampere 架构)。

  • 日志依据
    1
    
    time=2025-03-09T08:04:36.422+08:00 level=INFO source=server.go:130 msg=offload ... memory.required.allocations="[21.5 GiB]"
    
  • 细节
    • 显存分配:
      • 模型总需求 21.5GB,其中 18.5GB 为权重参数,2GB 为 KV 缓存。
      • 保留 417MB CPU 缓冲区用于临时计算。
    • 内存策略:
      • 优先 GPU 卸载,剩余参数保留在 CPU 内存。

  • 日志依据
    1
    2
    
    llama_model_loader: loaded meta data with 33 key-value pairs and 771 tensors ... (version GGUF V3)
    llm_load_tensors: offloaded 65/65 layers to GPU
    
  • 细节
    • 元数据:模型架构(Qwen2)、分词器配置(BPE)、上下文长度(131k)。
    • 张量加载
      • 321 个 FP32 张量(精确计算)。
      • 385 个 Q4_K 量化张量(高压缩率)。
      • 65 个 Q6_K 量化张量(平衡精度与速度)。
    • KV 缓存:分配 2GB 显存用于存储注意力键值对。

  • 日志依据
    1
    2
    
    time=2025-03-09T08:04:36.424+08:00 level=INFO source=server.go:380 msg="starting llama server" cmd="..."
    time=2025-03-09T08:04:43.186+08:00 level=INFO source=server.go:596 msg="llama runner started in 6.76 seconds"
    
  • 细节
    • 服务配置
      • 最大上下文长度 8192 token(单次对话上限)。
      • 批处理大小 512 token(优化吞吐量)。
    • 加速策略
      • 启用 CUDA 图(减少内核启动开销)。
      • 未启用 Flash Attention(可能因模型格式限制)。
    • 启动耗时:约 6.76 秒(包含预热时间)。

  1. 硬件检测:先看看电脑的显卡、内存够不够用。
  2. 资源分配:把模型大部分塞进显卡,小部分留在内存。
  3. 模型加载:读取模型文件,处理参数和分词器,分配显存。
  4. 服务启动:开个本地服务器,设置好参数,等待用户提问。

整个过程像拼乐高:先检查零件(硬件),再规划怎么拼(资源分配),然后组装(模型加载),最后插上电源(服务启动)!


当你在终端输入ollama run qwq:32b,就像按下火箭发射按钮。系统会像拼乐高一样,先检查你的电脑配置(显卡够不够大?内存够不够用?),然后把模型文件拆分成显卡能处理的"积木块",最后组装成能回答问题的AI引擎。整个过程涉及硬件检测、资源分配、模型加载、服务启动四个核心阶段。

https://via.placeholder.com/800x400?text=Ollama+Startup+Flowchart

系统首先扫描CUDA设备,检测到NVIDIA RTX 4090显卡(显存23GB)[[10]],同时验证其计算能力(Compute Capability 8.9)是否支持模型运算。这一步决定了能否启用GPU加速,否则会回退到CPU模式。

通过系统调用获取:

  • 物理内存:31.8GB(可用18.7GB)
  • 虚拟内存:22GB交换空间
  • GPU显存:23GB(可用21.5GB) 这些数据直接影响模型加载策略,比如是否启用显存分页[[3]]。
资源类型分配量用途
显存主缓冲区18.5GB存储模型权重
KV缓存2GB保存注意力机制中间结果
CUDA计算缓冲区696MB临时计算空间
CPU保留缓冲区417MB应急内存溢出处理

系统采用分层卸载策略,将65个模型层完全加载到GPU:

  • 64个重复层(Transformer块)优先占用显存
  • 1个输出层保留显存热点区域
  • 量化参数(Q4_K/Q6_K)动态调整内存布局[[5]]

sha256-c62ccde563模型文件中提取:

  • 33个元数据键值对(含模型架构、分词规则)
  • 771个张量(321个FP32 + 385个Q4_K + 65个Q6_K)
  • 152k词汇表的BPE分词器[[7]]
校验通过
GGUF文件
元数据解析
张量加载
验证模型参数
显存映射

关键参数配置:

1
2
3
4
5
6
{
  "n_ctx": 8192,        // 最大上下文长度
  "n_batch": 512,       // 批处理大小
  "n_gpu_layers": 65,   // GPU卸载层数
  "flash_attn": false   // 未启用Flash Attention
}
  • 开启HTTP服务端口(如52036)
  • 注册API端点:/api/show用于推理请求
  • 启动CUDA图优化(2个分割段)提升吞吐量[[8]]
  • 执行空循环推理验证服务稳定性
  • 初始化KV缓存预分配
  • 检测CUDA内存泄漏(通过676MB显存波动监控)[[9]]
  1. 显存不足处理:当模型超过单卡容量时,需修改n_gpu_layers参数进行分页加载
  2. 性能调优:批量大小(batch-size)建议设置为显存容量的1/4
  3. 版本兼容:GGUF v3格式需Ollama 0.2.0以上版本支持

通过ollama run命令,开发者可以像管理Docker容器一样操作大模型,将复杂的分布式计算简化为单机操作。这种抽象层设计,使得部署130B参数模型的门槛从博士级降到入门级[[6]]。

随机文章