INDEX
Explanations
questions that inquire about actions or definitions
New Auto-Interp
Negative Logits
anguage
-0.19
arella
-0.17
stanov
-0.17
ibold
-0.17
aginator
-0.17
borg
-0.16
neau
-0.15
miyor
-0.14
polator
-0.14
nero
-0.14
POSITIVE LOGITS
ower
0.15
eldorf
0.14
/do
0.14
λÏĮ
0.14
/cat
0.14
ê»ĺ
0.14
els
0.14
eno
0.13
lectric
0.13
satisfaction
0.13
Activations Density 0.049%