INDEX
Explanations
questions and introductory phrases that signal explanations or observations
New Auto-Interp
Head Attr Weights
0:0.04
1:0.01
2:0.07
3:0.05
4:0.03
5:0.11
6:0.02
7:0.03
8:0.41
9:0.03
10:0.08
11:0.05
Negative Logits
Kear
-2.01
acan
-1.71
venge
-1.70
prop
-1.66
yg
-1.62
yp
-1.58
wra
-1.54
aer
-1.50
bec
-1.46
arthed
-1.42
POSITIVE LOGITS
ALSE
1.90
ulla
1.87
ModLoader
1.79
soDeliveryDate
1.67
earances
1.61
affirmative
1.61
nexus
1.60
microsoft
1.59
IDA
1.54
odka
1.54
Activations Density 0.073%