INDEX
Explanations
mentions of the word "popularity" with varying degrees of emphasis
references to the concept of popularity
New Auto-Interp
Negative Logits
erm
-0.77
INAL
-0.73
uran
-0.73
Dull
-0.66
intest
-0.66
Shell
-0.65
Fury
-0.63
Neurolog
-0.61
inis
-0.61
¯¯¯¯
-0.59
POSITIVE LOGITS
ately
0.97
ability
0.90
ously
0.86
Reviewer
0.81
quo
0.76
itious
0.75
itism
0.73
rise
0.73
uation
0.71
acy
0.71
Activations Density 0.038%