INDEX
Explanations
the word "so" used in various contexts
New Auto-Interp
Negative Logits
roman
-0.20
so
-0.19
ting
-0.19
work
-0.17
phant
-0.17
ature
-0.16
b
-0.15
uckle
-0.15
rt
-0.15
un
-0.15
POSITIVE LOGITS
-called
0.40
ooo
0.26
oooo
0.25
ething
0.24
apy
0.23
iled
0.23
oth
0.22
oner
0.21
oooooooo
0.21
aping
0.19
Activations Density 0.038%