INDEX
Explanations
phrases related to negative actions or behaviors
New Auto-Interp
Negative Logits
edia
-0.78
_>
-0.76
itan
-0.74
akeru
-0.72
ocobo
-0.71
aeda
-0.71
udeau
-0.69
Downloadha
-0.65
Roosevelt
-0.65
btn
-0.65
POSITIVE LOGITS
cery
1.00
smelling
0.89
sie
0.84
terness
0.82
foul
0.78
mouth
0.76
s
0.75
nesses
0.73
eners
0.70
rance
0.70
Activations Density 0.018%