INDEX
Explanations
words related to physical harm or attack
instances of certain names and terms related to entities or characters
New Auto-Interp
Negative Logits
omen
-0.80
athlon
-0.76
ldon
-0.75
ples
-0.75
handshake
-0.73
etermination
-0.73
ally
-0.73
emet
-0.72
icrobial
-0.72
EStreamFrame
-0.72
POSITIVE LOGITS
Canaver
0.82
Beng
0.74
glers
0.74
Sebast
0.74
ABE
0.73
Kuala
0.70
Sebastian
0.69
Pengu
0.65
Footnote
0.65
Ops
0.64
Activations Density 0.033%