INDEX
Explanations
mentions of specific formatting elements within a text, such as advertisement separators or story continuations
instances of advertisement or promotional content
New Auto-Interp
Negative Logits
homebrew
-0.73
addin
-0.65
EStream
-0.64
romy
-0.62
leans
-0.61
adjunct
-0.60
boro
-0.60
ktop
-0.59
estranged
-0.56
issance
-0.54
POSITIVE LOGITS
ccording
0.84
JUST
0.78
SPONSORED
0.72
Spain
0.71
ATT
0.67
AIN
0.67
eria
0.67
TAG
0.66
Reward
0.65
Prev
0.65
Activations Density 0.073%