INDEX
Explanations
The neuron activates on words that signal problems, faults, or negative evaluations (e.g. malfunctions, errors, disadvantages).
New Auto-Interp
Negative Logits
_dropout
-0.07
_some
-0.07
refs
-0.07
closeButton
-0.06
Occ
-0.06
cultivation
-0.06
Kom
-0.06
startswith
-0.06
_vis
-0.06
graduate
-0.06
POSITIVE LOGITS
0.07
"=>"
0.06
。<
0.06
inton
0.06
erties
0.06
massac
0.06
ORTH
0.06
',['
0.06
Sr
0.06
Pregnancy
0.06
Activations Density 0.079%