INDEX
Explanations
instances of the word "think" in various forms
New Auto-Interp
Negative Logits
ez
-0.17
fred
-0.15
å§¿
-0.15
alent
-0.15
iser
-0.15
asar
-0.14
sher
-0.14
aná
-0.14
)((((
-0.14
/by
-0.14
POSITIVE LOGITS
about
0.24
twice
0.23
Twice
0.22
象
0.18
tank
0.18
_about
0.18
.about
0.17
-about
0.17
About
0.16
cape
0.16
Activations Density 0.084%