INDEX
Explanations
negative attributes and behaviors associated with characters, particularly cruelty and abrasiveness
New Auto-Interp
Negative Logits
ovice
-0.17
eteria
-0.17
EMPL
-0.16
Bedford
-0.15
ekten
-0.14
laughter
-0.14
830
-0.14
khung
-0.14
Zwe
-0.14
owi
-0.13
POSITIVE LOGITS
mean
0.33
abrasive
0.33
comb
0.32
ob
0.30
Mean
0.29
mean
0.29
rude
0.28
alo
0.28
confront
0.27
entitled
0.27
Activations Density 0.478%