INDEX
Explanations
verbs related to negative actions or consequences
verbs and phrases associated with actions that cause harm or influence outcomes
New Auto-Interp
Negative Logits
cerning
-0.67
seek
-0.67
"},
-0.65
spons
-0.62
rug
-0.61
FORE
-0.61
reserved
-0.60
ny
-0.59
nings
-0.58
considering
-0.58
POSITIVE LOGITS
ively
1.04
herself
0.83
himself
0.83
ibly
0.82
yourselves
0.80
ingly
0.73
yourself
0.72
themselves
0.71
auga
0.71
urally
0.68
Activations Density 0.480%