INDEX
Explanations
phrases related to a strong degree of certainty or emphasis
New Auto-Interp
Negative Logits
atures
-1.20
yne
-1.16
neapolis
-1.07
imeters
-0.99
eways
-0.95
GW
-0.95
onut
-0.94
İĭ
-0.93
itures
-0.91
Correction
-0.91
POSITIVE LOGITS
appreciated
1.28
obliged
1.13
appreci
1.06
resemble
0.99
distinguish
0.97
interested
0.96
welcome
0.95
influenced
0.95
regarded
0.94
________________
0.94
Activations Density 0.507%