INDEX
Explanations
phrases related to legal actions and consequences
references to legal and ethical issues
New Auto-Interp
Negative Logits
uart
-0.71
leans
-0.67
ridor
-0.61
ortment
-0.60
visual
-0.59
cean
-0.59
knit
-0.58
prepar
-0.58
ript
-0.58
resid
-0.57
POSITIVE LOGITS
unjust
0.83
bullies
0.82
injustice
0.74
tresp
0.73
cowardly
0.73
slander
0.71
abusive
0.70
unfair
0.69
harassment
0.69
merciless
0.68
Activations Density 0.790%