INDEX
Explanations
multiple languages
The neuron fires on specialized technical terminology—particularly NLP/linguistics jargon—rather than ordinary words.
New Auto-Interp
Negative Logits
ิหาร
-0.06
SWT
-0.06
Pixar
-0.06
load
-0.06
rif
-0.06
폴
-0.06
pulls
-0.06
爽
-0.06
_Il
-0.06
ICIENT
-0.06
POSITIVE LOGITS
Successful
0.07
compulsory
0.06
(elem
0.06
ograd
0.06
(errorMessage
0.06
usuarios
0.06
endorsement
0.06
(panel
0.06
cznie
0.06
fraudulent
0.06
Activations Density 0.123%