INDEX
Explanations
phrases indicating scientific agreement or conclusions in research papers
New Auto-Interp
Negative Logits
principalColumn
-0.72
Zunanje
-0.70
ecutive
-0.67
jectures
-0.66
MainAxisSize
-0.65
كومونز
-0.62
Tembelea
-0.60
.~(\
-0.60
BoxShadow
-0.60
Paglinawan
-0.60
POSITIVE LOGITS
↵↵
0.92
<eos>
0.74
↵↵↵
0.71
↵↵↵↵
0.68
The
0.60
↵↵↵↵↵↵↵↵↵↵↵↵↵↵
0.60
↵↵↵↵↵↵
0.58
↵↵↵↵↵
0.57
↵
0.54
↵↵↵↵↵↵↵↵↵
0.53
Activations Density 0.578%