INDEX
Explanations
references to personality traits and self-reflection
New Auto-Interp
Negative Logits
zusammen
-0.14
ÙĪÙĬÙĦ
-0.14
Primitive
-0.13
جÙĦ
-0.13
unrelated
-0.13
illion
-0.13
illon
-0.13
olas
-0.12
ÑĩиÑħ
-0.12
uster
-0.12
POSITIVE LOGITS
amb
0.41
ambiguity
0.34
mixed
0.33
ambiguous
0.33
oscill
0.32
undecided
0.32
split
0.31
neither
0.31
gray
0.31
mixed
0.30
Activations Density 0.488%