Mark My Words 3.0

"[Machinery] will never be a substitute for the face of a man, with his soul in it, encouraging another man to be brave and true." — Charles Dickens



Find Me

A rather unsurprising statistic from my Mendeley data. Using the gamma (γ) test to measure the association between the two ordinal variables gives 0.676: positive. This means that there is a positive association, or concordance, between articles that were read and starred. So, the more the item has been read the more it gets starred. Figures!

With my Mendeley data I was able to map the number of times papers have been starred (and read) against popular network metrics such as degree, closeness centrality, and eigenvector centrality. The graph above shows the association between degree and whether a paper has been starred. Neither of the distributions are Normal, and it appears that there is little association between papers with degrees lower than eighty. With higher-degree papers it appears that there is a more positive association. Once I know how, I’ll try and provide some statistics.

A rather intriguing network filtered by papers starred by 10 or more users. Vertex size is relative to degree and edge opacity to edge weight. I wonder what paper the small dot near the top right is?

Using my Mendeley data, I have graphed the papers (n=159) using three common network metrics: the x-axis is set to the degree (the number of unique edges that are connected to a vertex); the y-axis is set the the closeness centrality (how “close” each vertex is to all other vertices); and, vertex size and opacity which are set to the eigenvector centrality (how many connections a vertex has plus also the degree of the vertices it is connected to). As can be seen from the graph, the data is quite linear and shows four highly-ranked papers. If only I knew which ones they were…

Hacking Mendeley Data

Mendeley — a tool designed to help you share, organise and discover research papers — has entered the DataTEL Challenge and released a huge data set of 4.8M records. One of the data files contains the libraries of 50,000 Mendeley users. With this file I split it into 250,000-line chunks using the Mac Terminal command:

split -l 250000 libraries

I was then able to create a script to process each line of the (several) files and insert them into a PostgreSQL database. I had to do this in chunks, as I kept hitting memory and timeout issues due to the number of lines! Finally, after a good few hours, I had the records in a database. My next step was to find all of the papers that had been stored in more than one users’ library. After all, if only one user has it in their library its not worth doing any further calculation on! The result was a subset of 596,739 records. With this I exported the data to a CSV file which could be opened in Microsoft Excel. From my reading, I found that my edge list depicted a multimodal affiliation network (a combination of users and papers) which I could convert to a unimodal network using a PivotTable and SUMPRODUCT(). An example can be found here under the Serious Eats Affiliation Matrix Example. With that new data set I was able to save it as an open matrix which could then be imported into NodeXL. I now had 159 papers as vertices, and a total of 3,490 edges showing relationships between papers that had been saved by the same users. For example, if Paper A was saved by User X and User X also had saved Paper B there would be a relationship created.

I’ve managed to come up with two networks so far, albeit randomly. I was just pleased to have my data.

Following on from my last Mendeley post, here’s the paper-to-paper network showing three clusters. I wonder if these are arts, science, and both? Or maybe it’s not that at all! Each edge represents a relationship illustrating that if users have Paper A in their library they also have Paper B in their library. Again, will post my process soon so that you can check if I’m doing this “right”.

Today, I’ve been working with big data. Data made available by Mendeley who’re entering the DataTEL Challenge. Working with 4.8M records is a huge undertaking, and I’m definitely learning a lot about how to handle such a vast expanse of data. Anyway, I thought I’d share a paper-to-paper unimodal network showing 159 of the most stored papers from a selection of 50,000 Mendeley users. I will discuss my process another time, but thought I’d share my initial network diagram with you. Obviously there’s lots more work to be done, but I’m happy with my progress so far!

Also available in interactive format.

Loading posts...