INDEX
Explanations
instances of intentional and deliberate actions or consequences
New Auto-Interp
Negative Logits
/bit
-0.17
å¯Ħ
-0.15
rios
-0.15
alama
-0.15
ãģ¨ãĤĤ
-0.15
Tham
-0.14
Sensitive
-0.14
lä
-0.14
ensitive
-0.14
overall
-0.14
POSITIVE LOGITS
SED
0.19
ubar
0.17
fully
0.16
ously
0.16
gart
0.16
aidu
0.16
ably
0.15
atively
0.14
intentional
0.14
iously
0.14
Activations Density 0.051%