INDEX
Explanations
references to the concept of "fool" or foolishness
New Auto-Interp
Negative Logits
ello
-0.18
shire
-0.16
Miracle
-0.16
etta
-0.16
zing
-0.15
vt
-0.15
çĽĸ
-0.15
oga
-0.14
wick
-0.14
SURE
-0.14
POSITIVE LOGITS
hard
0.29
proof
0.28
ishly
0.25
ery
0.22
-proof
0.20
osoph
0.19
oose
0.18
ardy
0.18
ERY
0.18
fo
0.17
Activations Density 0.012%