INDEX
Explanations
the word "so" with varying degrees of emphasis
New Auto-Interp
Negative Logits
theless
-0.72
works
-0.61
work
-0.58
expectancy
-0.56
Purg
-0.55
amac
-0.52
Flavoring
-0.52
Halls
-0.52
eviction
-0.51
Mens
-0.51
POSITIVE LOGITS
bered
1.31
oths
1.26
apy
1.21
othes
1.18
oooo
1.11
othe
1.11
ooo
1.09
oner
1.05
far
0.96
aps
0.94
Activations Density 0.255%