INDEX
Explanations
instances of the word "on."
New Auto-Interp
Negative Logits
/by
-0.18
/from
-0.18
cul
-0.17
duct
-0.16
олÑĮно
-0.16
ducted
-0.15
clusions
-0.15
ì§ľ
-0.15
gos
-0.14
clarations
-0.14
POSITIVE LOGITS
balance
0.28
average
0.27
ward
0.27
closer
0.26
paper
0.26
top
0.26
reflection
0.23
average
0.22
rare
0.22
thing
0.20
Activations Density 0.069%