INDEX
Explanations
contradictory statements
New Auto-Interp
Negative Logits
ster
-0.66
jee
-0.63
ature
-0.62
por
-0.61
uttering
-0.59
enburg
-0.58
END
-0.57
lining
-0.57
lich
-0.56
favor
-0.55
POSITIVE LOGITS
soever
1.36
happens
1.36
happened
1.34
transpired
1.22
constitutes
1.11
else
0.94
happ
0.91
occurs
0.89
unfolds
0.87
mattered
0.86
Activations Density 0.098%