INDEX
Explanations
phrases indicating summarization or commentary
New Auto-Interp
Negative Logits
remember
-0.16
remember
-0.15
sov
-0.15
rvine
-0.15
edin
-0.14
icina
-0.14
iasi
-0.14
Ħĸ
-0.14
elden
-0.14
ãĥ¬ãĥĥãĥĪ
-0.14
POSITIVE LOGITS
describe
0.19
explain
0.18
correct
0.17
tell
0.17
Tell
0.17
descri
0.17
Tell
0.17
Describe
0.16
Describe
0.16
Explain
0.15
Activations Density 0.072%