INDEX
Explanations
phrases related to criticism or disapproval
instances of criticism or accountability in discussions
New Auto-Interp
Negative Logits
arnaev
-0.72
nel
-0.72
CrossRef
-0.70
Starship
-0.69
Sutherland
-0.69
aea
-0.66
Chung
-0.66
nosis
-0.65
usa
-0.65
Translation
-0.64
POSITIVE LOGITS
superiority
0.98
blasphemy
0.83
failures
0.83
unfair
0.81
injust
0.80
daring
0.80
frivolous
0.79
piety
0.77
accomplishments
0.76
inaction
0.76
Activations Density 0.606%