INDEX
Explanations
large language model
references to the model's identity or the phrase "As a large language model" (self‑referential model introductions).
The neuron activates on the self‐referential “As a large language model” style disclaimer phrase.
New Auto-Interp
Negative Logits
convirt
0.42
sehingga
0.40
wodurch
0.40
بنابراین
0.37
。
0.37
に示す
0.37
випад
0.37
تركيب
0.36
ngunit
0.36
ପ୍ର
0.35
POSITIVE LOGITS
indexRouter
0.43
étant
0.42
我都
0.40
being
0.39
YouTuber
0.39
having
0.39
having
0.38
itself
0.38
我会
0.38
lover
0.37
Activations Density 0.069%