INDEX
Explanations
words related to significant actions or events
references to various acts of wrongdoing or violence
New Auto-Interp
Negative Logits
sshd
-0.79
Flavoring
-0.78
corners
-0.73
ceilings
-0.70
Challenges
-0.67
kees
-0.65
ials
-0.65
ernels
-0.65
strands
-0.65
Generations
-0.64
POSITIVE LOGITS
sabotage
1.00
kindness
0.94
vandalism
0.87
EVA
0.83
heroism
0.83
desperation
0.80
aggression
0.77
luck
0.77
piracy
0.72
violence
0.70
Activations Density 0.048%