INDEX
Explanations
personal harm in the context of various scenarios
references to personal health and privacy concerns
New Auto-Interp
Negative Logits
zag
-0.67
hex
-0.64
iard
-0.64
eus
-0.63
owered
-0.63
oned
-0.62
gob
-0.62
optional
-0.62
opus
-0.62
ominated
-0.61
POSITIVE LOGITS
livelihood
1.45
credibility
1.43
integrity
1.32
reputation
1.32
ability
1.30
sanity
1.23
freedoms
1.19
psyche
1.18
liberty
1.18
morals
1.16
Activations Density 0.265%