INDEX
Explanations
references to humor and satire
New Auto-Interp
Negative Logits
ports
-0.88
ignty
-0.80
hips
-0.72
enfranch
-0.69
cryst
-0.68
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
-0.67
Components
-0.66
eded
-0.66
uchs
-0.65
yer
-0.65
POSITIVE LOGITS
ously
1.08
jokes
0.95
mocking
0.90
humour
0.86
parody
0.86
osity
0.85
joking
0.84
satir
0.84
joke
0.83
humor
0.83
Activations Density 1.212%