INDEX
Explanations
the word "surprise" at various levels of activation
expressions of surprise
New Auto-Interp
Negative Logits
nan
-0.83
folios
-0.81
©¶æ
-0.81
oreal
-0.81
odynam
-0.80
agra
-0.80
tein
-0.78
İĭ
-0.76
asus
-0.74
uel
-0.73
POSITIVE LOGITS
surprise
0.81
absor
0.79
surprises
0.79
guests
0.79
Surprise
0.76
ingly
0.76
Flavoring
0.75
visitor
0.74
Squid
0.74
ãģį
0.71
Activations Density 0.047%