INDEX
Explanations
words related to negative judgments about a person's character or behavior
derogatory terms or insults directed towards individuals
New Auto-Interp
Negative Logits
undai
-0.95
ells
-0.82
earchers
-0.80
elve
-0.80
fman
-0.80
anmar
-0.76
usable
-0.76
idays
-0.75
ña
-0.75
jong
-0.75
POSITIVE LOGITS
idiot
0.95
thief
0.79
extraord
0.75
idiots
0.72
beware
0.72
Investor
0.71
hypoc
0.71
liar
0.71
kid
0.69
loser
0.68
Activations Density 0.023%