INDEX
Explanations
references to contrasting ideas or alternatives
New Auto-Interp
Negative Logits
evi
-0.15
Į¨
-0.15
Ïį
-0.14
lez
-0.14
INGLE
-0.14
adel
-0.14
yla
-0.14
.ZERO
-0.13
ÑĢÑĥÑĪ
-0.13
sel
-0.13
POSITIVE LOGITS
side
0.41
half
0.34
end
0.33
extreme
0.32
party
0.31
direction
0.31
half
0.30
most
0.29
hemisphere
0.29
-half
0.29
Activations Density 0.088%