INDEX
Explanations
phrases that reflect personal introspection and self-reflection
New Auto-Interp
Negative Logits
ez
-0.16
rava
-0.16
imo
-0.15
fern
-0.15
bargain
-0.14
asers
-0.14
.mit
-0.14
indsight
-0.14
äge
-0.13
aná
-0.13
POSITIVE LOGITS
ways
0.26
Ways
0.18
象
0.17
how
0.17
differently
0.16
ramifications
0.15
ering
0.15
owitz
0.14
ulus
0.14
worst
0.14
Activations Density 0.068%