INDEX
Explanations
words related to policies and their impact
phrases detailing detrimental policies and their impacts
New Auto-Interp
Negative Logits
odiac
-0.75
rete
-0.71
raq
-0.70
Sync
-0.69
Beaver
-0.69
reciation
-0.68
Pixel
-0.65
Registered
-0.62
atha
-0.62
Feather
-0.61
POSITIVE LOGITS
harms
1.53
exacerbate
1.51
undermine
1.49
undermines
1.48
exacerb
1.45
impover
1.41
devast
1.36
undermined
1.35
jeopard
1.35
adversely
1.33
Activations Density 0.467%