INDEX
Explanations
spoiler warnings in texts
references to spoilers in the text
New Auto-Interp
Negative Logits
llan
-0.73
trak
-0.72
orney
-0.71
undai
-0.70
kes
-0.69
Architects
-0.68
Palest
-0.67
kos
-0.67
Motion
-0.67
urat
-0.66
POSITIVE LOGITS
spoiler
0.95
spoilers
0.94
OIL
0.92
OUS
0.84
spoil
0.82
ervative
0.75
spo
0.74
ific
0.73
alert
0.73
pedia
0.72
Activations Density 0.077%