INDEX
Explanations
expressions of contradictions or unexpected outcomes in societal conditions
New Auto-Interp
Negative Logits
ombat
-0.17
acomment
-0.15
udson
-0.14
olkien
-0.14
ptal
-0.14
ince
-0.14
ureau
-0.14
utton
-0.14
imbus
-0.14
edicine
-0.14
POSITIVE LOGITS
arak
0.18
noch
0.17
STILL
0.15
still
0.15
Still
0.15
still
0.15
Ox
0.15
plx
0.14
Still
0.14
cast
0.14
Activations Density 0.286%