INDEX
Explanations
references to societal issues
references to societal issues and conditions
New Auto-Interp
Negative Logits
urations
-0.86
Pad
-0.75
word
-0.71
rav
-0.67
iverse
-0.64
ruction
-0.63
imon
-0.62
etsk
-0.62
DEN
-0.62
Verb
-0.61
POSITIVE LOGITS
wide
0.88
eers
0.81
ically
0.75
eering
0.75
liness
0.74
fare
0.73
folk
0.72
geist
0.72
evolves
0.70
indo
0.68
Activations Density 0.020%