INDEX
Explanations
the neuron activations spike on the adjective “new,” indicating it detects uses of the word “new.”
New Auto-Interp
Negative Logits
sofas
-0.08
امر
-0.07
(face
-0.07
vrd
-0.07
against
-0.07
basket
-0.07
lerinden
-0.07
against
-0.07
troubled
-0.07
Sitting
-0.07
POSITIVE LOGITS
nev
0.06
závod
0.06
bev
0.06
LOUR
0.06
Shib
0.06
Northwest
0.05
_UINT
0.05
_currency
0.05
Claudia
0.05
conditioned
0.05
Activations Density 0.026%