INDEX
Explanations
instances of reference and commentary on perceptions or beliefs
New Auto-Interp
Negative Logits
oton
-0.16
olio
-0.16
itta
-0.15
ofi
-0.15
inte
-0.15
erb
-0.15
uide
-0.15
ozo
-0.14
itas
-0.14
wers
-0.14
POSITIVE LOGITS
instead
0.38
merely
0.36
Instead
0.33
Instead
0.32
nor
0.31
nor
0.30
instead
0.30
simply
0.27
Nor
0.27
Nor
0.27
Activations Density 0.247%