INDEX
Explanations
names of individuals
references to specific individuals and moral implications
New Auto-Interp
Negative Logits
nets
-0.86
sonian
-0.86
pack
-0.76
liners
-0.75
jri
-0.74
unders
-0.73
gers
-0.73
acement
-0.70
lets
-0.69
enegger
-0.69
POSITIVE LOGITS
terday
0.75
surv
0.73
hyde
0.72
HAEL
0.71
utical
0.67
ajor
0.67
ouched
0.66
VICE
0.64
Ba
0.64
wrestle
0.62
Activations Density 0.028%