INDEX
Explanations
sentences that critique societal norms and behaviors, particularly regarding humor and double standards in media
New Auto-Interp
Negative Logits
.�
-0.77
----------
-0.76
.''
-0.72
,''
-0.72
Copyright
-0.70
tion
-0.67
.}
-0.67
`.
-0.66
properties
-0.65
|
-0.64
POSITIVE LOGITS
haunted
0.72
inexpl
0.70
cannibal
0.69
coughing
0.69
scor
0.68
endlessly
0.68
sleek
0.68
humming
0.68
glowing
0.68
ooz
0.67
Activations Density 0.433%