INDEX
Explanations
statements related to belief or trust
phrases that express belief or trust
New Auto-Interp
Negative Logits
imilar
-0.77
aida
-0.71
idian
-0.70
sidel
-0.66
entin
-0.66
undy
-0.66
inav
-0.65
imer
-0.65
ertation
-0.65
ija
-0.63
POSITIVE LOGITS
when
0.72
WHEN
0.65
Yourself
0.61
hype
0.60
yourselves
0.57
unless
0.57
expr
0.56
whenever
0.56
eminent
0.56
}}
0.55
Activations Density 0.069%