INDEX
Explanations
phrases indicating specific types or categories of things
New Auto-Interp
Negative Logits
types
-0.20
kinds
-0.19
Types
-0.18
elsen
-0.17
Types
-0.16
_types
-0.16
-types
-0.15
sorts
-0.15
uhan
-0.15
types
-0.14
POSITIVE LOGITS
thing
0.24
thing
0.23
behaviour
0.15
äºĭæĥħ
0.15
warfare
0.15
behavior
0.15
ëį°
0.15
thinking
0.15
activity
0.15
IDI
0.15
Activations Density 0.075%