INDEX
Explanations
references to harmful or dangerous entities, such as killer whales or people
references to "killer whales."
New Auto-Interp
Negative Logits
ional
-0.91
ational
-0.83
ourced
-0.83
bles
-0.82
é¾
-0.80
yrinth
-0.80
ourcing
-0.78
edu
-0.78
OVER
-0.77
atile
-0.76
POSITIVE LOGITS
killer
0.93
killer
0.92
killers
0.84
whales
0.82
Killer
0.79
spree
0.77
instinct
0.75
whale
0.73
knife
0.71
Orange
0.71
Activations Density 0.028%