INDEX
Explanations
phrases indicating future intentions or desires
New Auto-Interp
Negative Logits
ories
-0.15
hiro
-0.15
resi
-0.14
oulouse
-0.14
fork
-0.14
esar
-0.14
ufe
-0.14
Hayes
-0.14
omain
-0.13
ilen
-0.13
POSITIVE LOGITS
onda
0.15
ittel
0.15
oir
0.14
beh
0.14
itable
0.14
omu
0.14
OnClick
0.14
steder
0.14
loc
0.14
achi
0.14
Activations Density 0.012%