INDEX
Explanations
references to spoilers in a discussion about TV shows
New Auto-Interp
Negative Logits
istik
-0.15
enberg
-0.14
çĭ
-0.14
ffa
-0.14
consect
-0.14
flakes
-0.14
intro
-0.14
èµ
-0.13
aptcha
-0.13
apult
-0.13
POSITIVE LOGITS
Spo
0.59
spoilers
0.55
spoiler
0.53
spoil
0.53
spo
0.52
spo
0.50
Spo
0.50
spoiled
0.45
Spoiler
0.39
spol
0.31
Activations Density 0.036%