INDEX
Explanations
phrases emphasizing consequences or relationships between actions and outcomes
New Auto-Interp
Negative Logits
egas
-0.17
ventus
-0.16
iverz
-0.15
indr
-0.15
ocks
-0.15
cko
-0.15
oose
-0.14
cf
-0.14
orno
-0.14
hardt
-0.14
POSITIVE LOGITS
ises
0.15
.sy
0.15
Genius
0.15
rzy
0.14
oll
0.14
.Zip
0.13
_CTL
0.13
FileUtils
0.13
moi
0.13
tú
0.13
Activations Density 0.091%