INDEX
Explanations
actions related to supporting, helping, or harming others
actions related to helping, harming, and the ethical implications of those actions
New Auto-Interp
Negative Logits
iHUD
-0.77
:]
-0.77
.}
-0.76
thereof
-0.73
hers
-0.70
guiActiveUnfocused
-0.69
................................................................
-0.67
ruff
-0.65
.","
-0.64
ItemThumbnailImage
-0.63
POSITIVE LOGITS
unsuspecting
1.07
strangers
0.99
hordes
0.96
peoples
0.94
passers
0.94
politicians
0.93
opponents
0.92
enemies
0.88
clients
0.88
celebrities
0.86
Activations Density 0.654%