INDEX
Explanations
terms related to harmful or dangerous individuals or entities
mentions of "killer" in various contexts, which likely indicates a focus on terms associated with dangerous entities or situations
New Auto-Interp
Negative Logits
rity
-0.89
bles
-0.80
urn
-0.77
ational
-0.77
ional
-0.77
edu
-0.74
ured
-0.72
ratulations
-0.71
rir
-0.70
urat
-0.69
POSITIVE LOGITS
killer
1.14
killer
1.05
killers
0.97
Killer
0.91
spree
0.84
whales
0.82
whale
0.76
knife
0.75
beware
0.72
fish
0.71
Activations Density 0.013%