INDEX
Explanations
expressions of surprise
New Auto-Interp
Negative Logits
another
-0.35
these
-0.33
Yeo
-0.33
קו
-0.32
eo
-0.32
a
-0.31
cleanup
-0.31
enforced
-0.30
toege
-0.29
ことなく
-0.29
POSITIVE LOGITS
surprised
1.80
surprised
1.73
Surprised
1.41
shocked
1.23
shocked
1.20
surpris
1.20
astonished
1.14
überrascht
1.09
amazed
1.09
sorprend
1.09
Activations Density 0.004%