INDEX
Explanations
mentions of inconsistencies or discrepancies in reasoning or arguments
New Auto-Interp
Negative Logits
makeText
-0.64
Netz
-0.63
@[+][
-0.61
estens
-0.58
Dienst
-0.57
Erde
-0.57
voyez
-0.57
sizeCache
-0.57
ruzzo
-0.57
SerializedSize
-0.57
POSITIVE LOGITS
discrepancy
0.99
contradictions
0.94
Incon
0.93
discrepancies
0.92
contradiction
0.85
Discre
0.81
Zeneca
0.80
contradictory
0.79
discre
0.78
contradic
0.75
Activations Density 0.024%