INDEX
Explanations
questions and references to individuals involved in decision-making or accountability
New Auto-Interp
Negative Logits
only
-0.19
only
-0.18
лиÑĪÑĮ
-0.17
ONLY
-0.16
Only
-0.15
fst
-0.15
à¥ĩवल
-0.15
hanya
-0.15
Only
-0.14
nowhere
-0.14
POSITIVE LOGITS
ultimately
0.26
owns
0.22
controls
0.21
these
0.20
Ultimately
0.20
they
0.20
those
0.19
actually
0.19
vlastnÄĽ
0.19
ultimate
0.19
Activations Density 0.154%