INDEX
Explanations
question answering
This neuron activates on self-introductory disclaimers, particularly the phrase “As an AI language model.”
New Auto-Interp
Negative Logits
iVar
-0.07
agree
-0.07
cade
-0.07
TRE
-0.07
question
-0.07
Gast
-0.07
协议
-0.06
ощи
-0.06
าณ
-0.06
observes
-0.06
POSITIVE LOGITS
-к
0.07
[new
0.06
celý
0.06
Camping
0.06
Measurement
0.06
gây
0.06
YYSTYPE
0.06
*-
0.06
μό
0.06
hoje
0.06
Activations Density 0.042%