INDEX
Explanations
sentences expressing concerns or fears about specific topics
New Auto-Interp
Negative Logits
Reloaded
-0.75
Replacement
-0.70
arat
-0.66
reversible
-0.66
Emin
-0.64
apiece
-0.64
Ships
-0.63
Repeat
-0.62
replacements
-0.60
Stainless
-0.59
POSITIVE LOGITS
privileged
0.91
instinctively
0.91
constantly
0.86
intimately
0.85
regularly
0.84
naturally
0.83
often
0.82
fascinated
0.81
routinely
0.81
responsibilities
0.79
Activations Density 0.391%