INDEX
Explanations
jokes and humor-related content
New Auto-Interp
Negative Logits
rrggbb
-0.60
Whitby
-0.57
alsey
-0.53
}>;
-0.49
Maritime
-0.46
CPO
-0.44
Clearwater
-0.44
Baran
-0.44
vallis
-0.44
guous
-0.43
POSITIVE LOGITS
joke
1.00
Joke
0.99
joke
0.96
Joke
0.96
jokes
0.91
Jokes
0.89
broma
0.85
Jokes
0.83
joking
0.79
jokes
0.77
Activations Density 0.003%