INDEX
Explanations
phrases related to moral and ethical considerations
New Auto-Interp
Negative Logits
gdala
-0.70
Zip
-0.62
Rost
-0.61
Mamm
-0.60
Democr
-0.60
izon
-0.60
wave
-0.59
Lars
-0.58
illusion
-0.55
Guilty
-0.54
POSITIVE LOGITS
attention
0.96
lessly
0.89
scrutiny
0.82
ENTION
0.78
FINE
0.77
tweaking
0.76
updating
0.74
Attention
0.73
correction
0.72
repairs
0.71
Activations Density 0.100%