INDEX
Explanations
references to hypocrisy and double standards in behavior or beliefs
New Auto-Interp
Negative Logits
cannot
-0.20
Cannot
-0.19
cannot
-0.17
Cannot
-0.16
Ø£ÙĬضا
-0.15
orig
-0.15
is
-0.14
Dont
-0.14
am
-0.14
ãĤ¤ãĥ³ãĥĪ
-0.14
POSITIVE LOGITS
're
0.47
've
0.42
'll
0.42
’re
0.40
'd
0.37
'm
0.36
’ll
0.35
’ve
0.35
’d
0.30
’m
0.29
Activations Density 1.011%