INDEX
Explanations
declarations of error or incorrectness in rationale or beliefs
New Auto-Interp
Negative Logits
apest
-0.72
concess
-0.71
egu
-0.67
estern
-0.59
yss
-0.58
Mub
-0.58
psey
-0.56
ileged
-0.56
orkshire
-0.54
earchers
-0.54
POSITIVE LOGITS
;)
0.75
imaginable
0.73
attRot
0.69
oneself
0.67
*.
0.66
?).
0.66
haha
0.65
itself
0.64
existed
0.64
herself
0.63
Activations Density 0.381%