INDEX
Explanations
references to names or specific nouns
mentions of individuals and episodes from certain series or events
New Auto-Interp
Negative Logits
oat
-0.93
bing
-0.85
ret
-0.77
oths
-0.75
reth
-0.72
Telegram
-0.68
bor
-0.67
rosse
-0.67
gest
-0.66
bage
-0.64
POSITIVE LOGITS
elson
0.79
letal
0.79
umat
0.77
opoulos
0.77
ewski
0.75
cart
0.74
ansen
0.74
umatic
0.73
emic
0.73
aimon
0.73
Activations Density 0.045%