INDEX
Explanations
This neuron responds to words and phrases describing negative consequences, risks, or harmful outcomes.
New Auto-Interp
Negative Logits
das
-0.07
virtual
-0.06
--------------------------------------------------------------------------------
-0.06
Your
-0.06
prevention
-0.06
者の
-0.06
your
-0.06
скую
-0.06
developers
-0.06
هوری
-0.06
POSITIVE LOGITS
had
0.07
REFERRED
0.07
have
0.07
DST
0.06
Perf
0.06
Courier
0.06
directed
0.06
Mitt
0.06
Has
0.06
held
0.06
Activations Density 0.072%