INDEX
Explanations
the word "pit" followed by a high activation number
repeated mentions of "pit bull."
New Auto-Interp
Negative Logits
ãĤµãĥ¼ãĥĨãĤ£ãĥ¯ãĥ³
-0.71
Carbuncle
-0.69
Lauder
-0.66
challeng
-0.65
proport
-0.64
IGH
-0.63
Feinstein
-0.62
lihood
-0.61
Polo
-0.60
Leilan
-0.59
POSITIVE LOGITS
iful
1.35
cair
1.28
ifully
1.27
iless
1.24
cher
1.08
bull
0.94
uit
0.90
iable
0.88
adium
0.87
reatment
0.86
Activations Density 0.039%