INDEX
Explanations
expressions related to personal opinions or viewpoints
New Auto-Interp
Negative Logits
orian
-0.17
eri
-0.16
uras
-0.16
eree
-0.15
leta
-0.15
lsi
-0.15
issen
-0.14
lier
-0.14
ifestyles
-0.14
ey
-0.13
POSITIVE LOGITS
ated
0.32
ATED
0.24
POSITE
0.22
aires
0.20
inions
0.20
naire
0.19
ative
0.19
formation
0.18
ating
0.18
naires
0.18
Activations Density 0.012%