INDEX
Explanations
phrases indicating political hypocrisy and the avoidance of difficult discussions
New Auto-Interp
Negative Logits
subpo
-0.16
irie
-0.13
emony
-0.13
.dtd
-0.13
woord
-0.13
stÃŃ
-0.13
ůž
-0.13
iminal
-0.13
imeo
-0.13
_MAY
-0.13
POSITIVE LOGITS
discussion
0.96
discussions
0.81
discussion
0.77
conversation
0.75
discuss
0.75
Discussion
0.74
debate
0.73
discussing
0.69
Discussion
0.68
discussed
0.65
Activations Density 0.789%