INDEX
Explanations
parodies and satirical content in text
instances of parody and satire in the text
New Auto-Interp
Negative Logits
erto
-0.78
oard
-0.75
Streamer
-0.75
violet
-0.75
vals
-0.72
negie
-0.72
rain
-0.71
Va
-0.71
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
-0.70
Transfer
-0.70
POSITIVE LOGITS
satir
1.25
satire
1.25
spoof
1.21
parody
1.20
mockery
1.04
mocking
1.04
satirical
1.02
caric
1.00
caricature
0.91
netflix
0.88
Activations Density 0.015%