I realised it’s been over a month since I started my secondment in Graz and thus I should get a blog post done for the TEAM blog. So my first several weeks here have been fun so far, with initially rather British style weather giving way to constant sunshine and temperatures over 30 degrees C! I’ve explored Graz (see photos here and here) and discovered new beers (Murauer being my favourite so far) and everyone has made me feel very welcome here. I’ve even learned a small amount of German, but mainly for doing things like ordering food or saying “ich spreche ein kleines bissche Deutsch”…
I’m working on deduplication of bibliographic records, with the aim of this work feeding back to Mendeley for an improved version of the catalogue. Specifically I’m developing ground truth datasets that can be used to train machine learners to deduplicate and also developing tools to analyse the clusters produced by the deduplication system, e.g. to enable analyses of the mistakes the system makes. I will also finish off work started at Mendeley on improving the efficiency of the deduplication system.
So far the main achievements I’ve made in this work are:
- Generating a deduplication dataset consisting of 400K documents, with 500K duplicate pairings and 500K non duplicate pairings derived from the arXiv data we imported into Mendeley and user added documents carrying arXiv identifiers. The code will later be adapted to generate another ground truth data set derived from the PubMed data we’ve imported and user documents carrying PubMed IDs.
- Obtaining the distribution of similarities between user added documents carrying arXiv or PubMed identifiers and the imported metadata from these sources in Mendeley’s catalogue. These distributions suggest that when user documents carry these identifiers the metadata is usually very similar to the imported metadata, and this is particularly true for the PubMed data.
- Creating tools for analysing the results of clustering such as evaluating how well documents carrying arXiv or PubMed identifiers have been clustered (based on assuming the arXiv or PubMed identifier is correct, or at least correct when the metadata is validated against the imported metadata from arXiv/PubMed), inter cluster similarities and cluster purities.
I’ll be moving on to performing experiments using the arXiv data set, whilst constructing a PubMed dataset to work with as well. Also, I intend to swap data with my fellow TEAM project member Ago who is doing deduplication in a different field.