INDEX
Explanations
elements and instances of self-reflection and existential questioning
New Auto-Interp
Negative Logits
brid
-0.81
increasingly
-0.78
raft
-0.78
pressing
-0.78
revers
-0.77
favour
-0.76
continuous
-0.75
favor
-0.75
coral
-0.74
cycl
-0.73
POSITIVE LOGITS
And
1.51
Advertisements
1.42
It
1.40
They
1.40
Instead
1.38
Because
1.38
That
1.37
However
1.36
Anyone
1.36
Until
1.35
Activations Density 0.401%