INDEX
Explanations
terms related to inappropriate, harmful, or unethical actions
terms related to abusive and unethical behavior
New Auto-Interp
Negative Logits
electric
-0.79
pop
-0.74
rition
-0.68
ellar
-0.67
soType
-0.66
oret
-0.63
arro
-0.61
ICA
-0.61
grown
-0.60
soDeliveryDate
-0.60
POSITIVE LOGITS
perpetrated
1.01
inflicted
0.90
incurred
0.90
towards
0.86
misconduct
0.83
committed
0.83
crimes
0.83
dealings
0.81
whatsoever
0.81
harming
0.80
Activations Density 0.174%