INDEX
Explanations
words expressing strong emotions or preferences towards certain actions or entities
expressions of affection and aversion
New Auto-Interp
Negative Logits
phabet
-0.81
EStreamFrame
-0.77
rome
-0.75
externalActionCode
-0.72
harm
-0.69
soType
-0.66
nw
-0.65
level
-0.65
levels
-0.65
aqu
-0.64
POSITIVE LOGITS
themselves
0.67
Pigs
0.61
revenge
0.59
foreigners
0.59
passionately
0.59
sticking
0.58
eagerly
0.58
outsiders
0.58
importing
0.56
advertising
0.56
Activations Density 0.319%