INDEX
Explanations
expressions of personal experience and introspection
New Auto-Interp
Negative Logits
seen
-0.16
barley
-0.15
bout
-0.15
elige
-0.14
ounge
-0.14
нÑĮо
-0.14
utin
-0.14
ognition
-0.14
apg
-0.14
Enlight
-0.14
POSITIVE LOGITS
suspect
0.21
dim
0.18
Sus
0.17
annis
0.17
worry
0.17
increasingly
0.17
privilege
0.17
lux
0.17
habit
0.16
contextual
0.16
Activations Density 0.420%