INDEX
Explanations
the presence of the word "at" and related variations in various contexts
New Auto-Interp
Negative Logits
es
-0.27
ing
-0.26
ed
-0.23
hole
-0.20
ho
-0.20
halt
-0.20
hb
-0.20
eri
-0.20
hoff
-0.19
hill
-0.19
POSITIVE LOGITS
ting
0.27
tempts
0.24
tempt
0.22
ernal
0.21
lı
0.20
tement
0.20
URNS
0.19
ollah
0.18
aylor
0.18
sume
0.18
Activations Density 0.099%