INDEX
Explanations
words associated with deception or negative consequences
New Auto-Interp
Negative Logits
ahlen
-0.15
aydı
-0.15
argon
-0.15
urch
-0.14
айд
-0.14
umann
-0.14
agma
-0.14
ÎijÎł
-0.14
ìm
-0.14
rale
-0.14
POSITIVE LOGITS
ous
0.82
ously
0.68
OUS
0.62
ious
0.56
uous
0.54
ouse
0.50
oust
0.50
uos
0.48
ousand
0.48
IOUS
0.47
Activations Density 0.063%