INDEX
Explanations
phrases indicating potential or capability
New Auto-Interp
Negative Logits
ccak
-0.15
Reasons
-0.15
utom
-0.14
usi
-0.14
³
-0.14
ãĥģãĥ¥
-0.14
hints
-0.14
ÏģÏī
-0.14
OTHERWISE
-0.13
SKIP
-0.13
POSITIVE LOGITS
added
0.27
mak
0.26
distinction
0.25
potential
0.24
capability
0.24
tendency
0.23
advantage
0.23
capacity
0.22
same
0.22
distinct
0.21
Activations Density 0.059%