INDEX
Explanations
themes related to hypocrisy and contradictions in beliefs versus actions
New Auto-Interp
Negative Logits
YLES
-0.17
usted
-0.16
ovie
-0.15
seedu
-0.15
irit
-0.15
pla
-0.15
fond
-0.14
strip
-0.14
isque
-0.14
exampleModal
-0.14
POSITIVE LOGITS
something
0.20
something
0.20
Something
0.17
Something
0.17
(thing
0.15
nÄĽco
0.15
excellence
0.15
omething
0.15
.Iter
0.14
ÅŁeyi
0.14
Activations Density 0.229%