INDEX
Explanations
questions and challenges regarding evidence or claims made
New Auto-Interp
Negative Logits
']}
-0.82
"]}
-0.69
]}$
-0.68
'}>
-0.68
estekak
-0.67
متعلقه
-0.66
"])
-0.65
)]
-0.63
]
-0.63
")}
-0.62
POSITIVE LOGITS
disagree
0.66
rebuttal
0.65
disprove
0.64
monger
0.61
Comparing
0.59
facts
0.59
refute
0.58
argument
0.57
arguments
0.56
judge
0.56
Activations Density 0.546%