INDEX
Explanations
evaluative adjectives indicating quality or preference
New Auto-Interp
Negative Logits
552
-0.16
dia
-0.14
pii
-0.14
赤
-0.14
beit
-0.14
idis
-0.14
ANTA
-0.14
Pot
-0.14
Ậ
-0.14
ONGL
-0.13
POSITIVE LOGITS
way
0.31
idea
0.29
Idea
0.27
choice
0.26
thing
0.25
bet
0.24
option
0.23
strategy
0.22
Way
0.22
strategy
0.21
Activations Density 0.061%