Skip to main content# Scaling Science '19

# September 2019

# August 2019

Published onNov 26, 2019

Scaling Science '19

In September, we focused on building a new dataset to help predict emerging technologies. In our original dataset, which we built from papers and authors in the top 40 or so journals, we found the emergence of machine learning and AI, but not CRISPR. We believe this is a result of the fact that CRISPR technology was not initially published in these top journals. In other words, we didn’t find CRISPR is not because our algorithm is not good enough, but because the data wasn’t sufficient. To address this, we found another 40 or so biology journals were CRISPR initially appeared and created an extended version of our original database with the information from these new papers.

After doing a round of machine learning with our previously hand-engineered features, we visualized the importance of different variables used in our process and used this visualization to debug some of the metrics on the graph. We rewrote and redeployed these metrics on the new, extended dataset.

We also deployed a paper called *Watch Your Step: Learning Node Embeddings via Graph Attention*(https://arxiv.org/abs/1710.09599) in order to get better node2vec embeddings. This paper cleverly optimizes one of the node2vec hyperparameters, context size, by jointly optimizing it with the embeddings while training on a downstream task. To create the training data, we used the method from the paper *Learning Edge Representations via Low-Rank Asymmetric Projections* (https://arxiv.org/abs/1705.05615), which amounts to using the citation data and semi-randomly removing edges to create train and test datasets. The method itself involves uses an attention mechanism to control the distribution over the context, rather than sampling uniformly as in node2vec. The intuition is that for different graphs, there are different levels of depth and breadth away from a node in which information is stored. For example, in a sparsely connected graph, the nodes that a node is connected to might be most important in determining the structure of a graph. In densely connected graphs, the degree of the node or the role that the node serves in a network (bridge vs hub, for example) can be more important. We believe this should lead us to embeddings that better capture the structure of our graph. In the future, we would like to implement embedding methods that can capture heterogenous graphs with different node types, rather than only the homogenous citation graphs we consider now.

I spent the month of August implementing the calculation of some citation-based metrics on our graph database full of scholarly publication metadata as well as training a UROP. The goal is to create models to quantify and predict the impact of papers, authors, and journals. In order to do this, we are creating hand engineered features using a paper by Luca Weihs and Oren Etzioni called Learning to Predict Citation-Based Impact Measures ( http://ai2-website.s3.amazonaws.com/publications/JCDL2017.pdf)

As an example of the types of features we were calculating, we counted up the number of coauthors for each author, calculated the average number of citations they receive per year, and measured the change to their h-index over the past two years. To capture recent changes in the graph, which can be a useful signal in prediction, we calculated things such as change in h-index over the past 2 years and change in number of publications. The full set of metrics is in the paper linked above.

One hard part about calculating these metrics was finding ways to calculate these metrics efficiently. One solution we ended up using was to calculate the more straightforward metrics from the structure of the graph, and then to save these metrics back to the graph to be used in the calculation of further metrics. For example, in calculating the average number of citations per year an author received, we calculated the total number of citations an author had received and the total number of years they have been publishing for, saved these to a property in the graph, and then simply divided. This approach was an order of magnitude faster than calculating the metrics from scratch using the graph structure, and is an approach that we used for numerous other metrics as well. A risk of this is that if one of the initial metrics was calculated improperly, then the error would be propagated to other metrics as well. To mitigate this risk, we had to be confident the metrics were correct, which was a challenge in and of itself.

Verifying that the metrics were correct was hard in the sense that we were calculating metrics on a dataset that is not perfectly equivalent to the public datasets one would find on Google Scholar, for example. Thus, when we were calculating the total number of citations or h-index for an author, we couldn’t simply cross check the results with what was publicly available. One way we remedied this was to calculate metrics by hand for entities in the early years of the graph, and to verify that the results matched what the expected result would be, giving us confidence that the function that was generating the results was indeed accurate.

We wish to see how well the hand engineered features from metadata can predict the impact of a paper as measured through some PageRank function, which uses either the citation graph or coauthor graph to assign every paper or author a ranking. Eventually, we plan on testing the predictive performance of these features against node2vec, which finds a low-dimensional representations of the graph using random walks like PageRank. Our hope is that we can achieve state of the art prediction with some combination of representation learning and hand-engineered features.