INDEX
Explanations
The neuron fires on phrases expressing disapproval or that something “is not okay.”
New Auto-Interp
Negative Logits
Saying
-0.07
اير
-0.07
area
-0.06
equilibrium
-0.06
calcul
-0.06
-specific
-0.06
there
-0.06
获
-0.06
showed
-0.06
triggered
-0.06
POSITIVE LOGITS
phê
0.07
(Vertex
0.07
this
0.07
(';0.07
(urls
0.06
DERP
0.06
WX
0.06
음악
0.06
.setView
0.06
_Bl
0.06
Activations Density 0.058%