INDEX
Explanations
phrases related to assertions or claims
New Auto-Interp
Negative Logits
aný
-0.15
gave
-0.15
راÙĩ
-0.14
Reviewed
-0.14
egot
-0.14
empor
-0.14
rve
-0.14
ming
-0.13
rina
-0.13
åΰäºĨ
-0.13
POSITIVE LOGITS
widely
0.26
learned
0.26
established
0.25
reported
0.24
understood
0.24
believed
0.23
sur
0.23
known
0.23
bel
0.23
wid
0.22
Activations Density 0.105%