INDEX
Explanations
statements or claims of fact
words related to factual statements or claims
New Auto-Interp
Negative Logits
İĭ
-0.80
throats
-0.76
gian
-0.69
veins
-0.69
charcoal
-0.68
Hutchinson
-0.68
haw
-0.67
eyebrows
-0.66
DER
-0.66
Bhar
-0.65
POSITIVE LOGITS
Fact
1.06
ional
0.87
Fact
0.83
fact
0.83
facts
0.81
itious
0.80
ual
0.78
ulence
0.77
onom
0.77
finding
0.75
Activations Density 0.008%