INDEX
Explanations
The neuron activates on words expressing human dignity, autonomy, and related respect-and-rights concepts.
New Auto-Interp
Negative Logits
.must
-0.07
lookout
-0.06
related
-0.06
-running
-0.06
์ว
-0.06
federation
-0.06
leftovers
-0.06
otros
-0.06
secretive
-0.06
clich
-0.06
POSITIVE LOGITS
dignity
0.13
dign
0.09
_UPPER
0.07
indign
0.07
Agricult
0.07
Signature
0.06
↵ ↵↵
0.06
ุณ
0.06
:!
0.06
.di
0.06
Activations Density 0.003%