INDEX
Explanations
references to spoilers in discussions about TV shows or movies
New Auto-Interp
Negative Logits
³
-0.14
apult
-0.14
ffa
-0.14
.espresso
-0.13
egas
-0.13
shoot
-0.13
enberg
-0.13
çĭ
-0.13
rians
-0.13
flakes
-0.12
POSITIVE LOGITS
Spo
0.67
spoil
0.63
spoiler
0.60
spo
0.60
spoilers
0.60
Spo
0.59
spo
0.59
spoiled
0.52
Spoiler
0.45
spol
0.37
Activations Density 0.048%