INDEX
Explanations
recurring themes and patterns in experiences and responses
New Auto-Interp
Negative Logits
unlike
-0.18
zar
-0.16
enaire
-0.16
orre
-0.15
Unlike
-0.15
optionally
-0.15
erate
-0.15
Unlike
-0.15
hin
-0.14
atel
-0.14
POSITIVE LOGITS
same
0.62
same
0.58
缸åIJĮ
0.54
Same
0.52
Same
0.52
identical
0.52
SAME
0.47
_same
0.45
similar
0.44
åIJĮ
0.44
Activations Density 0.056%