INDEX
Explanations
phrases containing the words "on the surface."
phrases indicating surface-level observations or appearances
New Auto-Interp
Negative Logits
ļéĨĴ
-0.81
arov
-0.78
ãĥ¥
-0.78
eed
-0.77
leground
-0.73
TN
-0.70
inar
-0.69
ufact
-0.69
staking
-0.68
arah
-0.68
POSITIVE LOGITS
however
1.08
though
0.82
somew
0.76
there
0.69
psychologists
0.68
moreover
0.66
please
0.66
suffice
0.65
whoever
0.63
there
0.63
Activations Density 0.274%