Original title: “Nerdy Statistics”, 6.4.2014
Greetings, everyone. I am a long-time reader, and almost-first-time poster. The official Battlelog forums make me simultaneously sad and angry; it’s always nice to come here and be whelmed by how everyone is so reasonable.
Some members of this community put forth quite a lot of effort digging into the a priori aspects of the game: the hard numbers straight from the game files on things like weapon performance. I’m afraid I’m not much help on that front, but I can, perhaps, make some “hard” a posteriori observations. Through the Battlelog stats API, I have gathered data on about 25,000 PC players (see below for that methodology); I think this sample is large enough that findings about it are significant. Here are a few examples of stuff I have gleaned:
The total k / d ratio of the entire game is around 1.34. That is total kills divided by total deaths.
The average k / d ratio of individual soldiers is 1.299 ± 0.0237 (99% confidence interval). That is the sum of all the individual soldiers’ k / d ratios divided by the total number of soldiers.
However, when dealing with ratios, it’s generally better to transform them logarithmically; the mean log( k / d ) is 0.1391 ± 0.0077, which translates back to a k / d of 1.1492, with a 99% confidence interval of (1.1403, 1.1581).
Here are a couple of graphics I made with some of the available data, too:
This one is from a simple random sample of 2,500 players (about 10% of the total data–all of the data on one scatter plot like this would be ridiculous):
The correlation value here is -0.03, so basically zero. Maybe all those “sissy snipers” aren’t actually doing their k / d ratios any favors. The more relevant plot would be log( k / d ) vs (fraction of) time using a sniper rifle, but that involves a slightly more, well, involved SQL query; I’ll save that one for another time.
So, if there are any statistical questions you guys have that I might be able to answer with this gigantic chunk of data, this is the place to ask. I’d love to attack them!
ADDENDUM: Data Collection Methodology
In order to query Battlelog’s stats API, an individual soldier’s “soldier ID” is required. To collect a bunch of IDs, I wrote a script to scrape the official Battlelog forums (I think just the BF4 General Discussions and Battlefield 4 - PC subforums, as I was seeking specifically data about PC players) for account names, then another to scrape those accounts’ profile pages looking for BF4 PC soldier IDs, then another to do a gazillion queries of the stats API. I ended up with data on just over 25,000 PC players in a 370ish MB SQLite database. The collection happened before the release of Naval Strike , although it happened close enough before then that it does feature some data on the NS weapons, vehicles, and equipment. I will probably run the collection script again in the near future; I have it running on a Raspberry Pi, and it’ll probably take between a week and two weeks to gather all the queries and update its database (which gets slower as the SQLite file gets larger).
So, admittedly, this isn’t a perfect random sample of players; it only includes players who have posted on the forums. I’m not sure what kind of bias this might induce in my data. Is it bias towards good players, because the ones most interested in the game are the ones who are going to post on the forums? Is it bias towards lousy players, because they are the ones spending all their time whining on the forums instead of getting better? In any case, the bias is probably much more deep and complicated than just good/bad (what do “good” and “bad” even mean in this context?). I honestly don’t know.