INDEX
Explanations
phrases indicating distance or extent
New Auto-Interp
Negative Logits
874
-0.18
nete
-0.17
ffer
-0.16
chw
-0.16
dings
-0.15
obar
-0.15
broader
-0.14
work
-0.14
IDD
-0.14
eil
-0.14
POSITIVE LOGITS
-reaching
0.28
thest
0.23
/fast
0.22
away
0.22
away
0.21
mland
0.20
into
0.20
into
0.19
Away
0.19
apart
0.18
Activations Density 0.033%