INDEX
Explanations
assertive statements regarding beliefs or opinions
New Auto-Interp
Negative Logits
ambi
-0.20
æ¥
-0.16
rist
-0.16
ê¸ī
-0.15
æ£
-0.15
arp
-0.14
à¹Ģ
-0.14
urrent
-0.14
uy
-0.14
946
-0.14
POSITIVE LOGITS
oldt
0.17
PIO
0.16
Vine
0.15
iences
0.15
intr
0.15
ansen
0.15
rak
0.15
intra
0.14
endid
0.14
bjerg
0.14
Activations Density 0.128%