INDEX
Explanations
words related to concepts of control, influence, and manipulation
words and patterns related to confirmation or agreement
New Auto-Interp
Negative Logits
chev
-0.85
itsch
-0.69
abouts
-0.64
\/\/
-0.64
WARD
-0.61
itta
-0.61
Uriel
-0.60
labels
-0.58
tarians
-0.58
agara
-0.58
POSITIVE LOGITS
ciating
0.87
ctions
0.83
enment
0.82
ctory
0.81
uration
0.80
rences
0.78
nces
0.76
ruction
0.75
rency
0.75
ption
0.74
Activations Density 0.065%