A psychologist's thoughts on how and why we play games

Friday, November 28, 2014

Exciting New Misapplications of The New Statistics

This year's increased attention to effect sizes and confidence intervals (ESCI) has been great for psychological science. ESCI offers a number of improvements over null-hypothesis significance testing (NHST), such as an attention to practical significance and the elimination of dichotomous decision rules.

However, the problem of ESCI is that it is purely descriptive, not inferential. No hypotheses are named, and so ESCI doesn't report on the probability of a hypothesis given the data, or even the probability of the data given a null hypothesis. No process or statistic turns the ESCI into a decision, although we might make Geoff Cumming cringe by looking at whether the ESCI includes zero and making a decision based on that, thereby falling right back to using NHST.

The point is, there's no theoretical or even pragmatic method for turning an ESCI into an inference. At what point does a confidence interval become sufficiently narrow to make a decision? We know that values near the extremes of the interval are often less likely than the values near the middle, but how much less likely?

I'm not asking for a formal dichotomous decision rule (I'm a Bayesian, I have resigned my life to uncertainty), but I've already noticed the ways we can apply ESCI inconsistently to overstate the evidence. See a recent example from Boothby, Clark, and Bargh (PDF link), arguing that shared experiences are more intense in two studies of n = 23 women:
Indeed, our analyses indicated that participants liked the chocolate significantly less when the confederate was also eating the chocolate (M = 2.45, SD = 1.77) than when the confederate was reviewing the computational products (M = 3.16, SD = 2.32), t(21) = 2.42, p = .025, 95% CI for the difference between conditions = [0.10, 1.31], Cohen’s d = 0.34. Participants reported feeling more absorbed in the experience of eating the chocolate in the shared-experience condition (M = 6.11, SD = 2.27) than in the unshared-experience condition (M = 5.39, SD = 2.43), p = .14. Participants also felt like they were more “on the same wavelength” with the confederate during the shared-experience condition (M = 6.43, SD = 1.38) compared with the unshared-experience condition (M = 5.61, SD = 1.38), t(21) = 2.35, p = .03, 95% CI for the difference
between conditions = [0.10, 1.54], Cohen’s d = 0.59 (see Fig. 2). There were no significant differences in participants’ self-reported mood or any other feedback measures between the shared and the unshared-experience conditions (all ps > .10).
Normally one wouldn't be allowed to talk about that p = .14 as evidence for an effect, but we now live in a more enlightened ESCI period in which we're trying to get away from dichotomous decision making. Okay, that's great, although I'd question the wisdom of trying to make any inference based on such a small sample, even within-subjects. But notice that when p = .14 is in the direction of their expected effect, it is interpreted as evidence for the phenomenon, but when differences are in a direction that does not support the hypothesis, it is simply reported as "not significant, p > .10". If we're going to abandon NHST for ESCI, we should at least be consistent about reporting ALL the ESCIs, and not just the ones that support our hypotheses.

Or, better yet, use that ESCI to actually make a principled and consistent inference through Bayes Factor. Specify an alternative hypothesis of what the theory might suggest are likely effect sizes. In this example, one might say that the effect size is somewhere between d = 0 and d = 0.5, with smaller values more likely than large values. This would look like the upper half of a normal distribution with mean 0 and standard deviation .5. Then we'd see how probable the obtained effect is given this alternative hypothesis and compare it to how probable the effect would be given the null hypothesis. At 20 subjects, I'm going to guess that the evidence is a little less than 3:1 odds for the alternative for the significant items, and less than that for the other items.

ESCI's a good first step, but we need to be careful and consistent about how we use it before we send ourselves to a fresh new hell. But when Bayesian analysis is this easy for simple study designs, why stop at ESCI?

Tuesday, July 1, 2014

Can p-curve detect p-hacking through moderator trawling?

NOTE: Dr. Simonsohn has contacted me and indicated a possible error in my algorithm. The results presented here could be invalid. We are talking back in forth and I am trying to fix my code. Stay tuned!

Suppose a researcher were to conduct an experiment looking to see if Manipulation X had any effect on Outcome Y, but the result was not significant. Since nonsignificant results are harder to publish, the researcher might be motivated to find some sort of significant effect somehow. How might the researcher go about dredging up a significant p-value?

One possibility is "moderator trawling". The researcher could try potential moderating variables until one is found that provides a significant interaction. Maybe it only works for men but not women? Maybe the effect can be seen after error trials, but not after successful trials? In essence, this is slicing the data until one manages to find a subset of the data that does show the desired effect. Given the number of psychological findings that seem to depend on surprisingly nuanced moderating conditions (ESP is one of these, but there are others), I do not think this is an uncommon practice.

To demonstrate moderator trawling, here's a set of data that has no main effect.
However, when we slice up the data by one of our moderators, we do find an effect. Here the interaction is significant, and the simple slope in group 0 is also significant.

In the long run, testing the main effect and three moderators will cause the alpha error rate to increase from 5% to 18.5%. That's the chance that at least one of the four tests come up p>.05, (.95)^4.

Because I am intensely excited by the prospect of p-curve meta-analysis, I just had to program a simulation to see whether p-curve could detect this moderator trawling in the absence of a real effect. P-curve meta-analysis is a statistical technique which examines the distribution of reported significant p-values. It relies on the property of the p-value that, when the null is true, p is uniformly distributed between 0 and 1. When an effect exists, smaller p-values are more likely than larger p-values, even for small p: p < .01 is more likely than .04<p<.05 for a true effect. Thus, a flat p-curve indicates no effect and possible file-drawering of null findings, while a right-skewed p-curve indicates a true effect. More interesting yet, a left-skewed curve suggests p-hacking -- doing what you need to to achieve p < .05, alpha error be damned.

You can find the simulation hosted on Open Science Framework at https://osf.io/ydwef/. This is my first swipe at an algorithm; I'd be happy to hear other suggestions for algorithms and parameters that simulate moderator trawling. Here's what the script does.
1) Create independent x and y variables from a normal distribution
2) Create three moderator variables z1, z2, and z3, of which 10 random subjects make up each of two levels
3) Fit the main effect model y ~ x. If it's statistically significant, stop and report the main effect.
4) If that doesn't come out significant, try z1, z2, and z3 each as moderators (e.g. y ~ x*z1; y ~ x*z2; y ~ x*z3). If one of these is significant, stop and plan to report the interaction.
5) Simonsohn et al. recommend using the p-value of the interaction for an attenuation interaction (e.g. "There's an effect among men that is reduced or eliminated among women"), but the p-values of each of the simple slopes for a crossover interaction (e.g. "This makes men more angry but makes women less angry."). So, determine whether it's an attenuation or crossover interaction.
5a)  If just one simple slope is significant, we'll call it an interaction. There's an effect in one group that is significant that is eliminated or reduced in the other group. In this case, we report the interaction p-value.
5b) If neither simple slopes are significant, or both are significant with coefficients of opposite sign, we'll call it a crossover. Both slopes significant indicates opposite effects, while neither slope significant indicates that the simple slopes aren't strong enough on their own but their opposition is enough to power a significant interaction. In these cases, we'll report both simple slopes' p-values.

We repeat this for 10,000 hypothetical studies, export the t-tests, and put them into the p-curve app at www.p-curve.com. Can p-curve tell that these results are the product of p-hacking?

It cannot. In the limit, it seems that p-curve will conclude that the findings are very mildly informative, indicating that the p-curve is flatter than 33% power, but still right-skewed, suggesting a true effect measured at about 20% power. Worse yet, it cannot detect that these p-values come from post-hoc tomfoolery and p-hacking. A few of these sprinkled into a research literature could make an effect seem to bear slightly more evidence, and be less p-hacked, then it really is.

The problem would seem to be that the p-values are aggregated across heterogeneous statistical tests: some tests of the main effect, some tests of this interaction or that interaction. Heterogeneity seems like it would be a serious problem for p-curve analysis in other ways. What happens when the p-values come from a combination of well-powered studies of a true effect and some poorly-powered, p-hacked studies of that same effect? (As best I can tell from the manuscript draft, the resulting p-curve is flat!) How does one meta-analyze across studies of different phenomena or different operationalizations or different models?

I remain excited and optimistic for the future of p-curve meta-analysis as a way to consider the strength of research findings. However, I am concerned by the ambiguities of practice and interpretation in the above case. It would be a shame if these p-hacked interactions would be interpreted as evidence of a true effect. For now, I think it best to report the data with and without moderators, preregister analysis plans, and ask researchers to report all study variables. In this way, one can reduce the alpha-inflation and understand how badly the results seem to rely upon moderator trawling.

Tuesday, May 20, 2014

Psychology's Awkward Puberty

There's a theory of typical neural development I remember from my times as a neuroscience student. It goes like this: in the beginning of development, the brain's tissues rapidly grow in size and thickness. General-purpose cells are replaced with more specialized cells. Neurons proliferate, and rich interconnections bind them together.

Around the time of puberty, neurons start dying off and many of those connections are pruned. This isn't a bad thing, and in fact, seems to be good for typical neural development, since there seems to be an association between mental disorder and brains that failed to prune.

In the past half a century, psychological science has managed to publish an astonishing number of connections between concepts. For example, people experience physical warmth as interpersonal warmth, hot temperatures make them see more hostile behavior, eating granola with the ingredients all mixed up makes them more creative than eating granola ingredients separately, and seeing the national flag makes them more conservative. Can all of these fantastic connections be true, important, meaningful? Probably not.

Until now, psychology has been designed for the proliferation of effects. Our most common statistical procedure, null hypothesis significance testing, can only find effects, not prove their absence. Researchers are rewarded for finding effects, not performing good science, and the weirder the effect, the more excited the response. And so, we played the game, finding lots of real effects and lots of other somethings we could believe in, too.

It's now time for us to prune some connections. Psychology doesn't know too little, it knows too much -- so much that we can't tell truth from wistful thinking anymore. Even the most bizarre and sorcerous manipulations still manage to eke out p < .05 often enough to turn up in journals. "Everything correlates at r = .30!" we joke. "Everything! Isn't that funny?" One can't hear the truth, overpowered as it is by the neverending chorus of significance, significance, significance.

This pruning process makes researchers nervous, concerned that their effect which garnered them tenure, grants, and fame will be torn to shreds, leaving them naked and foolish. We must remember that the authors of unreplicable findings didn't necessarily do anything wrong -- even the most scrupulous researcher will get p < .05 one time in 20 in the absence of a true effect. That's how Type I error works. (Although one might still wonder how an effect could enjoy so many conceptual replications within a single lab yet fall to pieces the moment they leave the lab.)

Today, psychology finally enters puberty. It's bound to be awkward and painful, full of hurt feelings, awkwardness, and embarrassment, but it's a sign we're also gaining a little maturity. Let's look forward to the days ahead, in which we know more through knowing less.

Monday, March 24, 2014

Intuitions about p

Two of my labmates were given a practice assignment for a statistics class. Their assignment was to generate simulated data where there was no relationship between x and y. In R, this is easy, and can be done by the code below: x is just the numbers from 1:20, and y is twenty random pulls from a normal distribution.

m1 = lm(y ~ x, data=dat)

One of my labmates ran the above code, frowned, and asked me where he had gone wrong. His p-value was 0.06 -- "marginally significant"! Was x somehow predicting y? I looked at his code and confirmed that it had been written properly and that there was no relationship between x and y. He frowned again. "Maybe I didn't simulate enough subjects," he said. I assured him this was not the case.

It's a common, flawed intuition among researchers that p-values naturally gravitate towards 1 with increasing power or smaller (more nonexistent?) effects. This is an understandable fallacy. As sample size increases, power increases, reducing the Type II error rate. It might be mistakenly assumed, then, that Type I error rate also reduces with sample size. However, increasing sample size does nothing to p-value when the null is true. When there is no effect, p-values come from a uniform distribution: a p-value less than .05 is just as likely as a p-value greater than .95!

As we increase our statistical power, the likelihood of Type II error (failing to notice a present effect) approaches zero. However, Type I error remains constant at whatever we set it to, no matter how many observations we collect. (You could, of course, trade power for a reduction in Type I error by setting a more stringent cutoff for "significant" p-values like .01, but this is pretty rare in our field where p<.05 is good enough to publish.)

Because we don't realize that p is uniformly distributed when the null is true, we overinterpret all our p-values that are less than about .15. We've all had the experience of looking at our data and being taunted by a p-value of 0.11. "It's so low! It's tantalizingly close to marginal significance already. There must be something there, or else it would have a really meaningless p-value like p=.26. I just need to run a few more subjects, or throw out the outlier that's ruining it," we say to ourselves. "This isn't p-hacking -- my effect is really there, and I just need to reveal it."

We say hopelessly optimistic things like "p = .08 is approaching significance." The p-value is doing no such thing -- it is .08 for this data and analysis, and it is not moving anywhere. Of course, if you are in the habit of peeking at the data and adding subjects until you reach p < .05, it certainly could be "approaching" significance, but that says more about the flaws of your approach to research than the validity of your observed effects.

How about effect size? Effect size, unlike p, benefits from increasing sample size whether there's an effect or not. As sample size is added, estimates of true effects approach their real value, and estimates of null effects approach zero. Of course, after a certain point the benefits of even more samples starts to decrease: going from n=200 to n=400 yields a bigger benefit to precision than does going from n=1000 to n=1200.

Let's see what effect size estimates of type I errors look like at small and large N.

Here's a Type I error at n=20. Notice that the slope is pretty steep. Here we estimate the effect size to be a whopping |r| = .44! Armed with only a p-value and this point estimate, a naive reader might be inclined to believe that the effect is indeed huge, while a slightly skeptical reader might round down to about |r| = .20. They'd both be wrong, however, since the true effect size is zero. Random numbers are often more variable than we think!

Let's try that again. Here's a Type I error at n = 10,000. Even though the p-value is statistically significant (here, p = .02), the effect size is pathetically small: |r| = .02. This is one of the many benefits of reporting the effect size and confidence interval. Significance testing will always be wrong at least 5% of the time, while effect size estimates will always benefit from power.

This is how we got the silly story about the decline effect (http://www.newyorker.com/reporting/2010/12/13/101213fa_fact_lehrer), in which scientific discoveries tend to "wear off" over time. Suppose you find a Type I error in your n=20 study. Now you go to replicate it, and since you have faith in your effect, you don't mind running additional subjects and re-analyzing until you find p < .05. This is p-hacking, but let's presume you don't care. Chances are it will take you more than 20 subjects before you "find" your Type I error again, because it's unlikely that you would be so lucky as to find the same Type I error within the first 20 subjects. By the point that you do find p < .05, you will probably have run rather more than 20 subjects, and so the effect size estimate will be a little more precise and be precipitously closer to zero. The truth doesn't "wear off." The truth always outs.

Of course, effect size estimates aren't immune to p-hacking, either. One of the serious consequences of p-hacking is that it biases effect sizes.

Collect big enough samples. Look at your effect sizes and confidence intervals. Report everything you've got in the way that makes the most sense. Don't trust p. Don't chase p.

Monday, December 9, 2013

Outrageous Fortune, pt. 1

When we sit down to play a game involving dice, we understand that the results are influenced by a combination of strategy and luck. However, it's not always clear which is more important. While we'd like to think our results are chiefly the result of good strategy, and that the role of luck was fair and minimal, it's often difficult to judge. How can we make games which incorporate luck while rewarding strategy?


In order to add some excitement and variety, many game developers like to add dice rolls to their games to introduce an element of randomness. Dice rolls, the argument goes, add an element of chance that keeps the game from becoming strictly deterministic, forcing players to adapt to good and bad fortune. While some dice rolls are objectively better than others, potentially causing one player to gain the upper hand over another through luck alone, developers claim that things will "average out" in the long run, with a given player eventually experiencing just as much good luck as bad luck.

Most outcomes are near the average, with equal amounts of "good luck" (area above green line) and "bad luck" (area below red line).

Luck should average out

With the effect of luck averaging out, the player with the better strategy (e.g., the player who obtained better modifiers on their rolls) should still be able to reliably perform better.  However, players and developers alike do not often realize just how many rolls are necessary before the effect of strategy can be reliably detected as something above and beyond the effect of luck.

Forums are full of players describing which build has the better average, which is often plain to see with some math. For many players, this is all they need to concern themselves with: they have done the math and determined which build is most effective. The question for the designer, however, is whether the players can expect to see a difference within a single game or session. As it turns out, many of these modifiers are so small compared to the massive variance of a pass-fail check that it takes surprisingly long for luck to "average out".

An example: Goofus and Gallant

For the following example, I'll use Dungeons & Dragons, since that's one most gamers are likely familiar with. D&D uses a 20-sided die (1d20) to check for success or failure, and by adjusting the necessary roll, probability of success ranges from 0% to 100% by intervals of 5%. (In future posts I hope to examine other systems of checks, like those used in 2d6 or 3d6-based RPGs or wargames.)

Consider two similar level-1 characters, Goofus and Gallant. Gallant, being a smart player, has chosen the Weapon Expertise feat, giving him +1 to-hit. Goofus copied all of Gallant's choices but instead chose the Coordinated Explosion feat because he's some kind of dingus. The result is we have two identical characters, one with a to-hit modifier that is +1 better than the other. So, we expect that, in an average session, Gallant should hit 5% more often than Goofus. But how many rolls do we need before we reliably see Gallant outperforming Goofus?

For now, let's assume a base accuracy of 50%. So, Goofus hits if he rolls an 11 or better on a 20-sided die (50% accuracy), and Gallant hits on a roll of 10 or better(55% accuracy). We'll return to this assumption later and see how it influences our results.

I used the statistical software package R to simulate the expected outcomes for sessions involving 1 to 500 rolls. For each number of rolls, I simulated 10,000 different D&D sessions. Using R for this stuff is easy and fun! Doing this lets us examine the proportion of sessions in which Gallant outperforms Goofus and vice-versa. So, how many trials are needed for Gallant to outperform Goofus?

Goofus hits on 11, Gallant hits on 10 thanks to his +1 bonus.

One intuitive guess would be that you need 20 rolls, since that 5% bonus is 1 in 20. It turns out, however, that even at 20 trials, Gallant only has a 56% probability of outperforming Goofus.

In order to see Gallant reliably (75%) outperform Goofus requires more than a hundred rolls. Even then, Goofus will still surpass him about 20% of the time. It's difficult to see the modifier make a reliable difference compared to the wild swings of fortune caused by a 50% success rate.

Reducing luck through a more reliable base rate

It turns out these probabilities depend a lot on the base probability of success. When the base probability is close to 50%, combat is "swingy" -- the number of successes may be centered at 50% times the number of trials, but it's also very probable that the number of successes may be rather more or rather less than the expected value. We call this range around the expected value variance. When the base probability is closer to 0% or 100%, the variance shrinks, and the number of successes tends to hang closer to the expected value.

This time, let's assume a base accuracy of 85%. Now, Goofus hits on 4 or better (85%), and Gallant hits on 3 or better (90%). How many trials are now necessary to see Gallant reliably outperform Goofus?

This time, things are more stable. For very small numbers of rolls, they're more likely to tie than before. More importantly, the probability of Gallant outperforming Goofus increases more rapidly than before, because successes are less variable at this probability.

Comparing these two graphs against each other, we see the advantages of a higher base rate. For sessions involving fewer than 10 rolls, it is rather less likely that Goofus will outperform Gallant -- they'll tie, if anything. For sessions involving more than 10 rolls, the difference between Goofus and Gallant also becomes more reliable when the base rate is high. Keep in mind that we haven't increased the size of the difference between Goofus and Gallant, which is still just a +1 bonus. Instead, by making a more reliable base rate, we've reduced the influence of luck somewhat. In either case, however, keep in mind that it takes at least 10 rolls before we see Gallant outperform Goofus in just half of sessions. If you're running a competitive strategy game, you'd probably want to see a more pronounced difference than that!

In conclusion

To sum it all up, the issue is that players and developers expect luck to "average out", but they may not realize how many rolls are needed for this to happen. It's one thing to do the math and determine which build has the better expected value; it's another to actually observe that benefit in the typical session. It's my opinion that developers should seek to make these bonuses as reliable and noticeable as possible, but your mileage may vary. This may be more important for certain games & groups than others, after all.

My advice is to center your probabilities of success closer to 100% than to 50%. When the base probability is high, combat is less variable, and it doesn't take as long for luck to average out. Thus, bonuses are more reliably noticed in the course of play, making players observe and enjoy their strategic decisions more.

Less variable checks also have the advantage of allowing players to make more involved plans, since individual actions are less likely to fail. However, when an action does fail, it is more surprising and dramatic than it would otherwise have been when failure is common. Finally, reduced variability allows the party to feel agentic and decisive, rather than being buffeted about by the whims of outrageous fortune.

Another option is to reduce the variance by dividing the result into more fine-grained categories than "success" and "failure" such as "partial success". Some tabletop systems already do this, and even D&D will try to reduce the magnitude of difference between success and failure by letting a powerful ability do half-damage on a miss, again making combat less variable. Upcoming Obsidian Software RPG Pillars of Eternity plans to replace most "misses" with "grazing attacks" that do half-damage instead of no damage, again reducing the role of chance -- a design decision we'll examine in greater detail in next week's post.

Future directions

Next time, we'll go one step further and see how hard it can be for that +1 to-hit bonus to actually translate into an increase in damage output. To do this, I made my work PC simulate forty million attack rolls. It was fun as heck. I hope to see you then!

Tuesday, April 10, 2012

Not Just Solid Food, But Real Food

Things have been quiet around here because I've been striving to get published elsewhere!  Today I have an article on Medium Difficulty: Not Just Solid Food, But Real Food. Please read it!

Last week, John Walker over at RPS wrote an editorial asking that games grow up and try taking on more serious themes.

For goodness sake, even Jennifer Aniston movies have more to say about love than all of gaming put together, and what Jennifer Aniston movies have to say about love is, “Durrrrrrrr.” Where is our commentary? Where is our criticism? Where is our subversion? Where is the game that questions governments, challenges society, hell, asks a bloody question? Let alone issues. Good heavens, imagine a game that dealt with issues!

 I found Walker's argument to be fundamentally flawed, suggesting that games are serious or worthwhile if and only if they have serious themes.  This is wrong for several reasons:  First, we already have lots of games that pretend to be about serious things but are utterly boneheaded.  Next, there are plenty of games, some of them thousands of years old, which are taken seriously and respected by all despite their lack of theme - consider Chess or Football.  Finally, since games are meant to be won, serious concepts like love or death will be reduced to things to be won or lost.

Theme doesn't make a smart game.  Smart gameplay makes smart games.

Sunday, March 25, 2012

Why do we play games? Pt 2: Self-Determination Theory

This is the second of the multi-part series reviewing psychological theories of why we enjoy playing games.  By understanding why we play games, we can make better games that fit those motives.  In the first part, we looked at flow theory and, because I'd had an extra cup of coffee that day, a case study of Tetris as an exemplar.

Everyone who's ever written a book about video game design is at least passingly familiar with Flow Theory.  However, there is another predominant theory of motivation called Self-Determination Theory (Deci & Ryan, 1978, give or take a few years), which is gaining some popularity in game studies. Self-Determination Theory (SDT) proposes that all people have three basic psychological needs. The first is autonomy, the feeling of being in control of one's own actions, as opposed to being controlled or commanded by someone else. Next is competence, the feeling of having ability, growing in skill, and being able to meet challenges. The last is relatedness, the feeling of caring for and being cared for by others. Self-Determination Theory posits that people will find an activity intrinsically motivating (that is, they will do it of their own volition) insofar as it meets these three psychological needs.

Dr. Andrew Przybylski, motivation psychology researcher
A psychologist named Andrew Przybylski has done some promising early research looking to whether games satisfy these psychological needs.  In one study, Przybylski had participants either play a critically well-reviewed game (Ocarina of Time) or a critical flop (A Bug's Life 64).  Players who played the better videogame not only reported enjoying it more, but they also reported greater feelings of autonomy, competence, and relatedness.  In another study, participants played three different videogames, all equally well-reviewed.  Participants turned out to like these games more or less depending on the extent to which they felt that these games met their needs for autonomy, competence, and relatedness.

We can conclude that, to at least some degree, people are playing games to satisfy their psychological needs.  This research raises three questions.  The first is a psychology question:  do people play games for reasons other than to satisfy needs for autonomy, competence, and relatedness?  The second is a design question:  How can we make games which best satisfy people's psychological needs?  The third is a personality psychology question:  what is it about a particular person that determines whether a certain game meets or fails to meet their needs?

Personally, I find one of these to be much more autonomy-supportive than the other.

The combination of the last two questions reflect my greatest curiosity and greatest criticisms about today's videogames.  To me, games today seem increasingly linear and simple.  As best I can understand, linear or heavily proscriptive games should stifle player's experience of autonomy.  Consider the notorious criticism of Modern Warfare 3's single-player as an "un-game."  I think that the core of this criticism is that the player feels that MW3 fails to be autonomy-supportive.  When the player wants to do something, s/he isn't allowed to: instead, the player gets pushed out of the way so that s/he doesn't end up interfering with the next scripted event. The player is not free to explore or make decisions for himself - instead, you spend a fair portion of the game behind a "follow" prompt so that you move through the cinematic setpieces in the way the developer wants you to.  Remembering the infamous "No Russian" mission from MW2, the player is forced to participate in a massacre, and the game ends abruptly if the player attempts to do anything but follow orders.  At times, one isn't so much a "player" of Modern Warfare's single-player campaign as one is a member of its audience.

Similarly, while I enjoyed Mass Effect 2 for its competent (although simple) cover-shooter combat, vivid alien species, and pleasant fashions by UnderArmour, I never really felt drawn into the general hoopla about the story.  I enjoyed the game well enough, but I never felt like I was really making my own story.  None of the choices I made really amounted to anything.  No matter what happened, I was destined to always go to the same places, shoot the same guys, and at the end, I'd have an opportunity to say something nice or say something mean.  The non-player characters would react more or less appropriately to what I said, but nothing carried forward into the future, except for which of my co-workers I was plotting to doink.  Sometimes I would make an important-sounding decision - do I brainwash an entire species or let them have their liberty? But the consequences were put off until the sequel, where I doubt they were ever addressed meaningfully.  Either way, I didn't feel like I was particularly autonomous or effective within the story - however, it was exactly this illusion of control that seemed to appeal to millions.  
It felt like the only really important decision I made was whom to romance (Miranda, duh).

On the whole, however, people seem to be pretty fond of Modern Warfare's single-player and other heavily-scripted games like it.  Mass Effect, for its part, is one of the most popular new series around.  What is different between me and the die-hard fans?  Maybe I'm just a cranky old coot who has played too many games and knows too well when I'm being railroaded.  If this is the case, we might expect that the more different games a person plays, or the more time they spend thinking about games like some crusty old nerd, the more "game-literate" they are. Highly "game-literate" players might be less convinced by the illusion of autonomy (or in MW single-player's case, the outright denial of autonomy) and receive poorer need satisfaction.  Alternatively, maybe having played certain exemplars which provided exceptional autonomy (maybe something like Fallout or X-COM or Dungeon Crawl) raises one's expectations & turns somebody into a bit of a snob.  This idea isn't too far out either - I remember seeing a similar idea advanced in a recent psychology symposium on nostalgia and experience, in which the lecturer presented data which suggested that people tend to pooh-pooh experiences in comparison to their best previous experience (ie, after having dined on authentic fresh sushi in Tokyo, the stuff at the local supermarket doesn't cut it anymore).  Finally, it's also possible that we enjoy on-rails "experiences" like Modern Warfare for reasons not covered by Self-Determination Theory.  Maybe it's simply exciting or spectacular, literally being a spectacle, and we find that to be fun or motivating.

I also feel like most big-budget single-player games are not very good about providing opportunities for the player to exercise competence.  When gameplay mechanics are simple, or challenges too easy, there's no thrill in victory.  The player cannot feel triumphant or skilled for winning, because victory was given to him/her on a plate.  Many games also do not seem to have much meat to their mechanics and dynamics.  Rather than being "easy to learn, difficult to master," these games are "easy to learn, easy to master."  By the end of the first stage or two, the player already knows everything necessary to skate to the end of the game.  It's just a matter of time to slog towards the end, usually more to see the conclusion of the story than to test yourself as a player.

This is why I often feel frustrated at the recent emphasis on stories in videogames.  Developers seem to give much more thought and publicity to the paper-thin layer of theme on top of their game, rather than the mechanics and challenge of the game itself.  It seems that every week there's a new video or interview with Ken Levine talking about Bioshock Infinite's characters or political themes or graphical design.  By comparison, we know very little about the mechanics of the gameplay, other than that there will be roller-coaster "Skylines" (a dubious-looking mechanism, given that the player seems to be a sitting duck on these rails).  Blockbuster after blockbuster, it's the same old run-and-gun, just with a new story sprinkled on top.

However, I'm probably in the minority here.  It's possible that stories might be able to provide feelings of competence for some, as players could be experiencing competence vicariously as they role-play a strong character like Bioshock Infinite's Booker DeWitt or Mass Effect's Shepard.  Challenge might not be necessary either.  Many players seem to get feelings of competence just from shooting something or watching those RPG numbers go up, regardless of challenge. Competence is theoretically driven by "setting goals and meeting them," but it's possible that those goals don't have to be particularly challenging to be rewarding.  Maybe it's that challenge makes for bigger variations in feelings of competence, experiencing higher highs and lower lows as we struggle between triumph and frustration.  If this is the case, maybe easy games are blandly comfortable, like a sitcom in its twelfth season, something marginally interesting and relaxing but unlikely to provide a peak experience.  This will appeal to many, but the experienced and daring will want something more challenging.

Yeah, yeah, you're a big man behind that turret, aren't you?

I've done a lot of thinking about what might cause some people to experience feelings of competence while others might not.  The most obvious predictor should be player skill.  When skill is matched perfectly to difficulty, competence is experienced.  When skill is too low for the difficulty, frustration ensues, or when skill is too high relative to the difficulty, the player becomes bored.  It's also possible that feelings of immersion or of actually being the game's hero might cause players to feel more competent - sort of an effect of power fantasy, as we make-believe that we are the powerful hero.  I also think that some people are more or less afraid of losing.  One day, I'd like to do a study to see if there are reliable personality differences in whether people feel like losing is fun or not.  As a fan of competitive games, sports, and roguelikes, I'm very comfortable with losing as often, if not moreso, than I win.  Every defeat teaches us a little something about how we can improve as players.  However, I've seen enough tantrums from players and read enough developer postmortems about frustrated playtesters to know that not everyone is like me.

If there's one thing that games are doing well, it's the capacity for relatedness.  Online games are more robust and popular than ever.  Never has it been easier to play in pairs and in larger groups, both with and against each other.  Developers should keep in mind that it should be easy for players to find and play with their particular friends, however - it's frustrating when you can only play with strangers, or when a game with online multiplayer also fails to allow for local multiplayer. The proliferation of internet discussion forums also seems important - games are more fun when you and your friends can talk about them together!  Relatedness motives might be part of why we seem to have no self-control about buying games.  We need to buy the game a week before it comes out so that we can play it the second it's released for fear that we miss out on discussion or find that the multiplayer community has moved on.
Are people still afraid that Farmville is the new game to beat?  Maybe it's easier to provide feelings of autonomy, competence, and relatedness than we thought.

In summary, Self-Determination theory suggests that people play games to satisfy their psychological needs for autonomy, competence, and relatedness, and that players will enjoy a particular game to the extent that it provides for those needs.  However, people are probably very different from each other in whether a game will ultimately suit their needs.

There are still many things about games which we enjoy that may not necessarily relate to the satisfaction of Self-Determination Theory needs.  For example, many players seem to enjoy games for their stories, but I'm not so sure how that might provide feelings of need satisfaction.  There also are plenty of wildly popular "no-challenge" games like Farmville, which wouldn't seem to provide feelings of competence or relatedness (spamming your friends with requests for energy isn't exactly quality time together).  I have at least one more theory to write about in a future article which may address some of those things. In the meantime, I have to keep thinking about Przybylski's study and wonder:  why does the same game meet some players' needs and fail to meet other players' needs?

Next time, we'll talk about what developers can do to make games which provide for Self-Determination Theory needs.