INDEX
Explanations
repeated usages of the word "We"
New Auto-Interp
Negative Logits
lights
-0.17
unda
-0.15
oub
-0.15
fu
-0.15
rim
-0.15
bÃŃ
-0.15
arra
-0.15
ctor
-0.15
dez
-0.15
double
-0.15
POSITIVE LOGITS
avers
0.25
imar
0.24
bsp
0.23
athers
0.23
itere
0.22
ave
0.21
asley
0.21
brtc
0.21
aving
0.21
aver
0.21
Activations Density 0.031%