INDEX
Explanations
words related to suppression or control of information or actions
terms related to suppression or repression
New Auto-Interp
Negative Logits
sen
-0.83
Journey
-0.73
rians
-0.72
deals
-0.71
ser
-0.70
Step
-0.69
ubuntu
-0.69
Sky
-0.69
PART
-0.69
ature
-0.69
POSITIVE LOGITS
suppress
1.27
suppression
1.17
suppressing
1.16
suppressed
1.08
muzzle
0.87
ively
0.83
encing
0.75
inhib
0.70
ences
0.70
apparatus
0.67
Activations Density 0.012%