INDEX
Explanations
responses that express advice or solutions to questions
New Auto-Interp
Negative Logits
stown
-0.17
akt
-0.16
SCRIPTOR
-0.15
boro
-0.14
fo
-0.14
bah
-0.14
iece
-0.14
ares
-0.14
åı·
-0.13
itness
-0.13
POSITIVE LOGITS
:↵↵
0.18
licos
0.14
flesh
0.14
COPE
0.14
loff
0.14
elabor
0.14
dera
0.13
ebi
0.13
+↵↵
0.13
Benchmark
0.13
Activations Density 0.006%