Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
e86701c
first commit, testing if the forks and everything worked
Pentaflouride Oct 25, 2021
bc67acd
Added the readme and datasets
Pentaflouride Oct 25, 2021
c208729
added the numpy version of the dataset and split it up too
Pentaflouride Oct 26, 2021
39f0247
Working version of the GCN. Accuracy is too low
Pentaflouride Oct 26, 2021
cba5050
improving accuracy by changing hyper parameters
Pentaflouride Oct 26, 2021
2f03f18
Evaluating the testing data
Pentaflouride Oct 26, 2021
deaa0ca
Added TSNE
Pentaflouride Oct 26, 2021
541b12f
The updated TSNE was not added in the last commit
Pentaflouride Oct 26, 2021
d4091a0
added comments and cleaned up a bit of the code
Pentaflouride Oct 26, 2021
e87616c
changed the jupyter notebook to a py file
Pentaflouride Oct 26, 2021
66f7c67
took out some print statements out of the model file. Also slightly c…
Pentaflouride Oct 26, 2021
fd43e3b
added the driver.py file
Pentaflouride Oct 26, 2021
8b144fb
Testing readme
Pentaflouride Oct 26, 2021
18c93af
Finished introduction
Pentaflouride Oct 26, 2021
e398f94
Added details about the problem
Pentaflouride Oct 26, 2021
4fd3981
Update README.md
Pentaflouride Oct 26, 2021
ce9182e
newlines
Pentaflouride Oct 26, 2021
367365f
Update README.md
Pentaflouride Oct 26, 2021
70e6c57
how it works added
Pentaflouride Oct 27, 2021
7757824
Update README.md
Pentaflouride Oct 27, 2021
e25fd22
formatting
Pentaflouride Oct 27, 2021
e60894b
train and val accuracy image
Pentaflouride Oct 27, 2021
ea3dda0
added the images of the graphs
Pentaflouride Oct 27, 2021
46b9e75
Merge branch 'topic-recognition' of https://github.com/Pentaflouride/…
Pentaflouride Oct 27, 2021
8fb6cfc
Update README.md
Pentaflouride Oct 27, 2021
3fce77f
image hopefully works now
Pentaflouride Oct 27, 2021
551691e
TSNE image added
Pentaflouride Oct 27, 2021
cbdd671
traning and testing proof
Pentaflouride Oct 27, 2021
4e1c96f
Merge branch 'topic-recognition' of https://github.com/Pentaflouride/…
Pentaflouride Oct 27, 2021
e9f8858
training, testing and usage added
Pentaflouride Oct 27, 2021
25b1633
Added dependencies
Pentaflouride Oct 27, 2021
2bb7cbf
Finished README
Pentaflouride Oct 27, 2021
cf09286
Changed some headers
Pentaflouride Oct 27, 2021
d406825
Proof read
Pentaflouride Oct 27, 2021
f1a9df2
Deleted, the notebook file.
Pentaflouride Nov 20, 2021
dd71288
Delete data files
Pentaflouride Nov 23, 2021
434aee3
Delete data files
Pentaflouride Nov 23, 2021
b5877da
Delete data files
Pentaflouride Nov 23, 2021
feefbd6
Delete data files
Pentaflouride Nov 23, 2021
063141f
Added block description
Pentaflouride Nov 23, 2021
646ef94
delete
Pentaflouride Nov 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions recognition/45249435/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Multilayer- GCN
GCNs are nothing more than graph convolutional networks. This just refers to the type of data imported. In the case of CNN usually there is a 2D image and a filter
and sequentially the filter moves through the image doing operations on what it sees. But now assume that your data is a graph where nodes are connect to each other
via an edge. The reason why CNNs would not work in this case is because the graph representing the dataset is not part of a Euclidean space. This makes CNNs useless
for datasets represented by graphs. In short, GCNs are just CNNs for sets of data represented with a graph. An example of such a dataset (and the dataset I will
make a GCN on) is the [Facebook Large Page-Page Network dataset](https://snap.stanford.edu/data/facebook-large-page-page-network.html). I will classify the nodes of
this dataset, that means that given the **128 of the features** of each webpage, I will predict whether or not they are part of one of the four classes:
* tvshow
* government
* company
* politician

Given that this problem is supposed to be semi supervised, the dataset needs to be split accordingly. This means that the training and validation sets need to be
significantly smaller than the testing set.


## How it works
1. The program imports and reads the data given as numpy arrays which are then changed to pandas dataframes
2. The data is split into training validation and testing and because it is semi supervised the training and validation sets only have 500 points in it
3. A graph like data structure is created from the set of data
4. One hot encoding is used to get the categorical target variable to a numerical state
5. The model is then initialised using arbitrary hyperparameters at first but later it is tuned for better accuracy
6. The model is then trained and evaluated
7. The evaluation consists of a graph showing the validation and training accuracy and loss for each epoch.
8. Using the training data the model tries to predict the page outcome.
9. The model is then shrunk down to 2 dimensions and plotted using tsne

## High level explanation of the algortihm
Each node has a lot of features describing it and neighbouring nodes. Each node sends a message to each one of its neighbours with all the features it has. These
features from each neighbour are transformed using a linear operation (ie average). This is then put through a standard neural network layer and the ouput is then
the new state of the node. This is done for every node in the graph.

## Train and validation graphs
![train and validation accuracy](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/train_val%20accuracy.png)

## TSNE embedded graph
![TSNE](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/tsne.png)

## Training the model
![training of the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/Training_stage.png)

## Testing the model
![testing the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/testing%20model.png)

## Other outputs
Other outputs like shape of data and how I progressed in solving the problem is given in the notebook. Model.py is a refined version of the notebook wrapped in a
function. The driver.py file runs the model.py file and gives the outputs given above (i.e it does not show any less important outputs like shapes of data).

## Usage
Run the driver.py file to get all the main outputs given above. The driver.py file does not require any arguments. The driver file has a main function so it will
run as soon driver.py is run.

## Driver Dependencies
* Tensorflow
* Sklearn
* Keras
* Pandas
* Matplotlib
* Stellargraph

## Extra Notebook dependencies
* Scipy
Binary file added recognition/45249435/Training_stage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
18 changes: 18 additions & 0 deletions recognition/45249435/driver.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
import model

def main():
""" running this file with start training the GCN model for the facebook dataset.
After training is done it will evaluate the validation accuracy.
After the validation accuracy is evaluated it will graph the validation
accuracy along with its loss and also the training accuracy and loss.
Lastly it will give a TSNE plot of the dataset and how it was evaluated given
specific colours. If there are more than 4 colours there was an error.
Also note that the TSNE plot might not match the colours given in the
README file since the models changes each time it is run and the colour
choices are random.
"""
model.run_model()


if __name__=="__main__":
main()
107 changes: 107 additions & 0 deletions recognition/45249435/model.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
# Model.py by Paul Turculetu (GCN algortihm)
# Feel free to use any of this code for any of your needs
# November 2021
# Final report
# Training a GCN for the facebook dataset and producing a tsne

import pandas as pd
import numpy as np
import stellargraph as sg
from sklearn.model_selection import train_test_split
from sklearn import preprocessing as pre
from tensorflow.keras import layers, optimizers, losses, Model
from stellargraph.mapper import FullBatchNodeGenerator
from stellargraph.layer import GCN
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

def run_model():
"""loads and preprocess data, then it trains, evaluates and graphs a TSNE
on the classification of nodes. It also graphs the error on the evaluation
and training dataset
"""

# load the numpy arrays of the data given in the question
# also find out how many classes the target variable has
np_edges = np.load("edges.npy")
np_features = np.load("features.npy")
np_target = np.load("target.npy")

# store the data as dataframes also, give the columns proper names
# so things don't become confusion. Make data into a graph with edges and nodes
df_features = pd.DataFrame(np_features)
df_edges = pd.DataFrame(np_edges)
df_targets = pd.DataFrame(np_target)
df_edges.columns = ["source", "target"]
df_targets.columns = ["target"]
mat = sg.StellarGraph(df_features, df_edges)

# split the data into train, test and validation keeping in my that
# the train and validation sets need to be significantly smaller than
# the testing set.
train_data, test_data = train_test_split(df_targets, train_size=500)
val_data, test_data = train_test_split(test_data, train_size=500)

# one hote encode the target datasets because right now each class is
# represented by a string
one_hot_target = pre.LabelBinarizer()
train_targets = one_hot_target.fit_transform(train_data['target'])
val_targets = one_hot_target.transform(val_data['target'])
test_targets = one_hot_target.transform(test_data['target'])

# initialize the model changing the hyper parameters to get
# better results
generator = FullBatchNodeGenerator(mat, method="gcn")
train_gen = generator.flow(train_data.index, train_targets)
gcn = GCN(
layer_sizes=[32, 32], activations=["relu", "relu"], generator=generator, dropout=0.2
)
x_in, x_out = gcn.in_out_tensors()
pred = layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out)

# optimize the model using the adam optimizer
model = Model(inputs=x_in, outputs=pred)
model.compile(optimizer=optimizers.Adam(learning_rate=0.01),
loss=losses.categorical_crossentropy,
metrics=["acc"],
)
val_gen = generator.flow(val_data.index, val_targets)


# train the model
result = model.fit(
train_gen,
epochs=100,
validation_data=val_gen,
verbose=2,
shuffle=False
)

# show an accuracy graph
sg.utils.plot_history(result)

# Test the model on the testing data
test_gen = generator.flow(test_data.index, test_targets)
print("testing data accuracy given below: ")
model.evaluate(test_gen)

# set up the tnse by getting the full dataset
all_nodes = df_targets.index
all_gen = generator.flow(all_nodes)

embedding_model = Model(inputs=x_in, outputs=x_out)
emb = embedding_model.predict(all_gen)
X = emb.squeeze(0)

# turn the data into 2 dimensions.
tsne = TSNE(n_components=2)
X_2 = tsne.fit_transform(X)

# do an tsne plot
fig, ax = plt.subplots(figsize=(10, 10))
ax.scatter(X_2[:, 0],X_2[:, 1],c=df_targets.squeeze(),cmap='turbo',
alpha=0.5)
ax.set(
title="TSNE visualization of GCN embeddings for facebook dataset"
)

Binary file added recognition/45249435/testing model.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added recognition/45249435/train_val accuracy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added recognition/45249435/tsne.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.