INDEX
Explanations
AI language model
This neuron detects the model’s self-description phrase “As a large language model” (and similar self-referential disclaimers).
New Auto-Interp
Negative Logits
BK
0.43
creciente
0.43
Б
0.40
приветствую
0.39
B
0.38
расту
0.38
aumentada
0.38
blight
0.37
समावेश
0.37
esp
0.37
POSITIVE LOGITS
doesn
0.45
ستانی
0.41
doesn
0.40
项
0.38
紓
0.37
വുമായി
0.37
item
0.36
stanu
0.36
정
0.36
eikä
0.36
Activations Density 0.017%