INDEX
Explanations
terms related to clear and unambiguous actions or statements
instances of the word "outright" indicating a strong affirmation or assertion
New Auto-Interp
Negative Logits
ulton
-0.78
agine
-0.78
ĺħ
-0.75
nan
-0.72
arts
-0.71
anners
-0.71
anwhile
-0.71
ramid
-0.67
Neighbor
-0.67
nesota
-0.65
POSITIVE LOGITS
Introduced
0.76
ãĤ¦ãĤ¹
0.75
hostility
0.73
shown
0.71
eless
0.70
Discuss
0.69
iary
0.68
itarian
0.68
refusal
0.66
ãĥ
0.65
Activations Density 0.015%