INDEX
Explanations
phrases indicating observation or perception
New Auto-Interp
Negative Logits
iga
-0.15
isi
-0.15
ier
-0.15
ï¸ı
-0.14
é
-0.14
unt
-0.14
igers
-0.14
uss
-0.14
iner
-0.13
asc
-0.13
POSITIVE LOGITS
evidence
0.19
/he
0.17
plenty
0.16
kaar
0.15
Evidence
0.15
/read
0.15
ohana
0.14
753
0.14
OMPI
0.14
Unnamed
0.14
Activations Density 0.116%