INDEX
Explanations
phrases that express providing information or insights
New Auto-Interp
Negative Logits
/Instruction
-0.17
iras
-0.15
iven
-0.15
unami
-0.15
entai
-0.15
ÎIJ
-0.14
aska
-0.14
uentes
-0.14
ahoma
-0.14
phinx
-0.14
POSITIVE LOGITS
idea
0.60
Idea
0.48
idea
0.46
sense
0.39
notion
0.35
indication
0.35
feel
0.33
IDEA
0.30
handle
0.30
sense
0.29
Activations Density 0.137%