INDEX
Explanations
expressions prompting personal reflection or self-assessment
New Auto-Interp
Negative Logits
alth
-0.17
Sez
-0.16
ismet
-0.15
ledon
-0.15
hare
-0.15
konk
-0.14
-meta
-0.14
ropolis
-0.14
edback
-0.14
bles
-0.14
POSITIVE LOGITS
whether
0.17
agate
0.17
zew
0.14
482
0.14
Whether
0.14
gra
0.14
leur
0.14
gis
0.14
chor
0.14
orer
0.13
Activations Density 0.254%