INDEX
Explanations
numerical measurements such as scores, statistics, or other quantifiable data
New Auto-Interp
Negative Logits
atem
-0.55
Uriel
-0.55
Sorceress
-0.54
Aly
-0.53
£ı
-0.53
retweet
-0.53
Jagu
-0.53
DRAG
-0.51
Lara
-0.51
Levant
-0.51
POSITIVE LOGITS
th
1.43
76
1.27
71
1.24
56
1.21
87
1.21
73
1.20
61
1.20
72
1.20
34
1.20
54
1.19
Activations Density 1.377%