INDEX
Explanations
The neuron selectively activates for the preposition “against,” flagging instances of that specific word.
New Auto-Interp
Negative Logits
290
-0.08
660
-0.07
110
-0.07
395
-0.07
do
-0.07
988
-0.07
kode
-0.07
lul
-0.07
ho
-0.06
099
-0.06
POSITIVE LOGITS
against
0.17
against
0.15
Against
0.14
Against
0.13
против
0.08
UNG
0.08
:"; ↵
0.08
приг
0.07
apanese
0.07
zens
0.07
Activations Density 0.028%