INDEX
Explanations
words related to ethics or moral reasoning
New Auto-Interp
Negative Logits
glers
-0.92
Abyss
-0.74
ERY
-0.70
Cage
-0.67
Leap
-0.66
Coalition
-0.66
Ducks
-0.64
Bruins
-0.64
ggle
-0.63
Gru
-0.61
POSITIVE LOGITS
utations
1.44
ulsive
1.30
ublic
1.25
rehensible
1.25
rieve
1.23
roach
1.20
orters
1.20
rint
1.19
ressed
1.17
uted
1.17
Activations Density 0.012%