INDEX
Explanations
emoticons and affirmative phrases
The neuron detects tokens that are part of the model's direct factual answer or highlighted content—especially proper nouns, numbers, and emphasized/answer text.
New Auto-Interp
Negative Logits
teorie
0.48
carbone
0.44
troviamo
0.42
tanha
0.42
transgress
0.41
théorie
0.40
cynicism
0.40
rebellion
0.40
exacerbate
0.39
transgression
0.39
POSITIVE LOGITS
:)
0.59
:)
0.53
:-)
0.52
😊
0.51
it
0.49
;)
0.48
or
0.48
不过
0.47
៕
0.47
🙂
0.47
Activations Density 0.654%