Introduction

I use GitHub a lot for schoolwork and personal projects. It is a wonderful platform for deploying, publishing, and collaborating on coding project. Because of how nifty it is, I am excited to explore this graph on GitHub project membership. I am not sure how the data was pre-precessed. By looking at the interactive network data visualization and analytics platform, it appears that all the nodes are connected. This only has 121,709 nodes compared to GitHub's over 100 million total projects, so it is likely that this is the largest connected component or a subgraph of the largest connected component. It was taking way too long to process, so I ended up taking a random sample of 40,000 nodes and extracting the largest connected component, which ended up having 19,498 nodes. This subgraph still required about seven hours to run the whole notebook.

This graph is bipartite, with one collection of nodes representing membership and the other representing projects. I think member nodes with high centrality values will likely senior engineers at larger companies, or are significant figures in the field. Project nodes with high centrality values will likely be highly used or integral projects. I do not think the degree distribution will exhibit a power law because highly involved developers only have so much time. Projects can also only have so much people working on them or else it is very hard to keep everyone up to date and efficient. I don't know if it will exhibit the small world property. I think this graph might have the small world property becuase it is much smaller than all of GitHub and likely has the most important projects and people, but the total GitHub network would probably not.

Functions

Analyses

Preprocessing

Produce a visualization of the graph (or graph sample that you used)

Find the 10 nodes with the highest degree.

Find the 10 nodes with the highest betweenness centrality.

Find the 10 nodes with the highest clustering coefficient. If there are ties, choose 10 to report and explain how the 10 were chosen.

The ten chosen were the first 10 in the sorted list of node clustering coefficients

Find the top 10 nodes as ranked by eigenvector centrality

Find the top 10 nodes as ranked by Pagerank

Comment on the differences and similarities in questions 12-16. Are the highly ranked nodes mostly the same? Do you notice significant differences in the rankings? Why do you think this is the case?

All of the measures above tend to have many of the same nodes. Clustering coefficient is the only one that has significantly different values, but they all have a clustering coefficient of 1. The top nodes in the other measures would probably also have a clustering coefficient of 1, but there might just be too many. I think this shows that the people or projects who are the most integral are truly the most important in the graph. Nodes 301, 654, 866, and 8 seem to especially important. It would be interesting to see which projects or people these nodes represent.

Compute the average shortest path length in the graph. Based on your result, does the graph exhibit small-world behavior?

This graph seems to exhibit small-world behavior. The average shortest path is very close to log base 10 of the number of nodes. I tried to get the omega small-world coefficient, but it ran for over 16 hours without finishing.

Plot the degree distribution of the graph on a log-log-scale. Does the graph exhibit power law behavior? Include the plot and the code used to generate it in your submission.

[EC] Create a log-log plot with the logarithm of node degree on the x-axis and the logarithm of the average clustering coefficient of nodes with that degree on the y-axis. Does the clustering coefficient exhibit power law behavior (is there a clustering effect)? Include the plot and the code used to generate it in your submission.

The clustering coefficient is either 1 or 0 for this graph. I'm not sure why that is.