INDEX
Explanations
assertive statements or judgments
phrases that express safety and certainty
New Auto-Interp
Negative Logits
tons
-0.71
ĸļ
-0.64
listed
-0.63
Saving
-0.63
mattered
-0.62
millenn
-0.58
ories
-0.57
objectionable
-0.57
IMAGES
-0.57
pieces
-0.57
POSITIVE LOGITS
assume
1.39
conclude
1.24
speculate
1.15
say
1.07
presume
1.05
expect
0.99
suggest
0.99
criticize
0.98
argue
0.98
ask
0.94
Activations Density 0.071%