INDEX
Explanations
significant keywords or phrases related to decision-making and societal impact
nouns related to abstract concepts and specific topics
New Auto-Interp
Negative Logits
PLA
-0.59
unden
-0.58
ilogy
-0.57
Bah
-0.56
anus
-0.55
noon
-0.55
laughs
-0.55
sole
-0.55
senal
-0.55
Fighters
-0.53
POSITIVE LOGITS
mattered
0.97
becomes
0.89
cannot
0.86
flowed
0.84
mith
0.82
must
0.82
evolves
0.79
magically
0.78
matter
0.77
ceases
0.77
Activations Density 0.479%