INDEX
Explanations
phrases related to knowledge, beliefs, and actions taken by individuals or groups
statements of knowledge or claims about various subjects
New Auto-Interp
Negative Logits
itaire
-0.59
eur
-0.55
earcher
-0.55
aml
-0.53
advoc
-0.52
icter
-0.52
asus
-0.51
pex
-0.51
ogl
-0.50
Pass
-0.50
POSITIVE LOGITS
themselves
1.12
selves
0.90
selves
0.89
THEIR
0.65
their
0.64
MpServer
0.61
helmets
0.61
jointly
0.60
li
0.60
asses
0.59
Activations Density 0.859%