INDEX
Explanations
large language model, trained by
tokens that indicate the speaker’s identity as a large language model (words like "large", "language", "model" and related self‑identifying phrases/questions).
New Auto-Interp
Negative Logits
locks
0.47
resembled
0.47
resulted
0.43
becomes
0.43
0.43
τού
0.41
ទ្រ
0.41
0.41
0.41
becomes
0.40
POSITIVE LOGITS
работаю
0.67
uyorum
0.64
знаю
0.57
हूं
0.57
jestem
0.55
ıyorum
0.54
atualmente
0.54
这意味着
0.54
我现在
0.53
আছি
0.51
Activations Density 0.133%