INDEX
Explanations
sentences involving expressing opinions, reactions, or critiques
New Auto-Interp
Negative Logits
vid
-0.17
ire
-0.15
вид
-0.14
Vid
-0.14
PTS
-0.14
pil
-0.14
elda
-0.14
PT
-0.14
others
-0.13
osh
-0.13
POSITIVE LOGITS
ýt
0.16
iscrimination
0.16
iciary
0.16
_HERE
0.16
PROFITS
0.15
,application
0.14
ioned
0.14
mpar
0.14
]={↵0.14
.sg
0.14
Activations Density 0.395%