INDEX
Explanations
phrases indicating reciprocal relationships or opposition
New Auto-Interp
Negative Logits
essim
-0.15
ati
-0.15
loon
-0.15
_locals
-0.14
esel
-0.14
owell
-0.14
atu
-0.14
av
-0.14
uen
-0.14
uide
-0.14
POSITIVE LOGITS
convers
0.19
vice
0.18
igne
0.17
VICE
0.16
Vice
0.15
vero
0.15
ajs
0.15
versa
0.15
etat
0.15
vice
0.14
Activations Density 0.010%