Mendeley — a tool designed to help you share, organise and discover research papers — has entered the DataTEL Challenge and released a huge data set of 4.8M records. One of the data files contains the libraries of 50,000 Mendeley users. With this file I split it into 250,000-line chunks using the Mac Terminal command:
split -l 250000 libraries
I was then able to create a script to process each line of the (several) files and insert them into a PostgreSQL database. I had to do this in chunks, as I kept hitting memory and timeout issues due to the number of lines! Finally, after a good few hours, I had the records in a database. My next step was to find all of the papers that had been stored in more than one users’ library. After all, if only one user has it in their library its not worth doing any further calculation on! The result was a subset of 596,739 records. With this I exported the data to a CSV file which could be opened in Microsoft Excel. From my reading, I found that my edge list depicted a multimodal affiliation network (a combination of users and papers) which I could convert to a unimodal network using a PivotTable and SUMPRODUCT(). An example can be found here under the Serious Eats Affiliation Matrix Example. With that new data set I was able to save it as an open matrix which could then be imported into NodeXL. I now had 159 papers as vertices, and a total of 3,490 edges showing relationships between papers that had been saved by the same users. For example, if Paper A was saved by User X and User X also had saved Paper B there would be a relationship created.
I’ve managed to come up with two networks so far, albeit randomly. I was just pleased to have my data.