INDEX
Explanations
references to specific individuals, particularly public figures and their actions
New Auto-Interp
Negative Logits
points
-0.77
binary
-0.69
cape
-0.69
isted
-0.69
Franç
-0.68
stice
-0.68
istance
-0.68
opausal
-0.67
ãĥĦ
-0.67
itarian
-0.67
POSITIVE LOGITS
McCabe
0.96
hew
0.86
hews
0.73
atcher
0.73
plot
0.71
onduct
0.70
shaw
0.67
ursed
0.66
20439
0.65
abouts
0.64
Activations Density 0.005%