diff --git a/recognition/45249435/README.md b/recognition/45249435/README.md new file mode 100644 index 0000000000..82b3260310 --- /dev/null +++ b/recognition/45249435/README.md @@ -0,0 +1,62 @@ +# Multilayer- GCN +GCNs are nothing more than graph convolutional networks. This just refers to the type of data imported. In the case of CNN usually there is a 2D image and a filter +and sequentially the filter moves through the image doing operations on what it sees. But now assume that your data is a graph where nodes are connect to each other +via an edge. The reason why CNNs would not work in this case is because the graph representing the dataset is not part of a Euclidean space. This makes CNNs useless +for datasets represented by graphs. In short, GCNs are just CNNs for sets of data represented with a graph. An example of such a dataset (and the dataset I will +make a GCN on) is the [Facebook Large Page-Page Network dataset](https://snap.stanford.edu/data/facebook-large-page-page-network.html). I will classify the nodes of +this dataset, that means that given the **128 of the features** of each webpage, I will predict whether or not they are part of one of the four classes: +* tvshow +* government +* company +* politician + +Given that this problem is supposed to be semi supervised, the dataset needs to be split accordingly. This means that the training and validation sets need to be +significantly smaller than the testing set. + + +## How it works +1. The program imports and reads the data given as numpy arrays which are then changed to pandas dataframes +2. The data is split into training validation and testing and because it is semi supervised the training and validation sets only have 500 points in it +3. A graph like data structure is created from the set of data +4. One hot encoding is used to get the categorical target variable to a numerical state +5. The model is then initialised using arbitrary hyperparameters at first but later it is tuned for better accuracy +6. The model is then trained and evaluated +7. The evaluation consists of a graph showing the validation and training accuracy and loss for each epoch. +8. Using the training data the model tries to predict the page outcome. +9. The model is then shrunk down to 2 dimensions and plotted using tsne + +## High level explanation of the algortihm +Each node has a lot of features describing it and neighbouring nodes. Each node sends a message to each one of its neighbours with all the features it has. These +features from each neighbour are transformed using a linear operation (ie average). This is then put through a standard neural network layer and the ouput is then +the new state of the node. This is done for every node in the graph. + +## Train and validation graphs +![train and validation accuracy](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/train_val%20accuracy.png) + +## TSNE embedded graph +![TSNE](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/tsne.png) + +## Training the model +![training of the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/Training_stage.png) + +## Testing the model +![testing the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/testing%20model.png) + +## Other outputs +Other outputs like shape of data and how I progressed in solving the problem is given in the notebook. Model.py is a refined version of the notebook wrapped in a +function. The driver.py file runs the model.py file and gives the outputs given above (i.e it does not show any less important outputs like shapes of data). + +## Usage +Run the driver.py file to get all the main outputs given above. The driver.py file does not require any arguments. The driver file has a main function so it will +run as soon driver.py is run. + +## Driver Dependencies +* Tensorflow +* Sklearn +* Keras +* Pandas +* Matplotlib +* Stellargraph + +## Extra Notebook dependencies +* Scipy diff --git a/recognition/45249435/Training_stage.png b/recognition/45249435/Training_stage.png new file mode 100644 index 0000000000..b5f815fe39 Binary files /dev/null and b/recognition/45249435/Training_stage.png differ diff --git a/recognition/45249435/driver.py b/recognition/45249435/driver.py new file mode 100644 index 0000000000..746aa7e9ef --- /dev/null +++ b/recognition/45249435/driver.py @@ -0,0 +1,18 @@ +import model + +def main(): + """ running this file with start training the GCN model for the facebook dataset. + After training is done it will evaluate the validation accuracy. + After the validation accuracy is evaluated it will graph the validation + accuracy along with its loss and also the training accuracy and loss. + Lastly it will give a TSNE plot of the dataset and how it was evaluated given + specific colours. If there are more than 4 colours there was an error. + Also note that the TSNE plot might not match the colours given in the + README file since the models changes each time it is run and the colour + choices are random. + """ + model.run_model() + + +if __name__=="__main__": + main() \ No newline at end of file diff --git a/recognition/45249435/model.py b/recognition/45249435/model.py new file mode 100644 index 0000000000..dc4edfbc58 --- /dev/null +++ b/recognition/45249435/model.py @@ -0,0 +1,107 @@ +# Model.py by Paul Turculetu (GCN algortihm) +# Feel free to use any of this code for any of your needs +# November 2021 +# Final report +# Training a GCN for the facebook dataset and producing a tsne + +import pandas as pd +import numpy as np +import stellargraph as sg +from sklearn.model_selection import train_test_split +from sklearn import preprocessing as pre +from tensorflow.keras import layers, optimizers, losses, Model +from stellargraph.mapper import FullBatchNodeGenerator +from stellargraph.layer import GCN +from sklearn.manifold import TSNE +import matplotlib.pyplot as plt + +def run_model(): + """loads and preprocess data, then it trains, evaluates and graphs a TSNE + on the classification of nodes. It also graphs the error on the evaluation + and training dataset + """ + + # load the numpy arrays of the data given in the question + # also find out how many classes the target variable has + np_edges = np.load("edges.npy") + np_features = np.load("features.npy") + np_target = np.load("target.npy") + + # store the data as dataframes also, give the columns proper names + # so things don't become confusion. Make data into a graph with edges and nodes + df_features = pd.DataFrame(np_features) + df_edges = pd.DataFrame(np_edges) + df_targets = pd.DataFrame(np_target) + df_edges.columns = ["source", "target"] + df_targets.columns = ["target"] + mat = sg.StellarGraph(df_features, df_edges) + + # split the data into train, test and validation keeping in my that + # the train and validation sets need to be significantly smaller than + # the testing set. + train_data, test_data = train_test_split(df_targets, train_size=500) + val_data, test_data = train_test_split(test_data, train_size=500) + + # one hote encode the target datasets because right now each class is + # represented by a string + one_hot_target = pre.LabelBinarizer() + train_targets = one_hot_target.fit_transform(train_data['target']) + val_targets = one_hot_target.transform(val_data['target']) + test_targets = one_hot_target.transform(test_data['target']) + + # initialize the model changing the hyper parameters to get + # better results + generator = FullBatchNodeGenerator(mat, method="gcn") + train_gen = generator.flow(train_data.index, train_targets) + gcn = GCN( + layer_sizes=[32, 32], activations=["relu", "relu"], generator=generator, dropout=0.2 + ) + x_in, x_out = gcn.in_out_tensors() + pred = layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out) + + # optimize the model using the adam optimizer + model = Model(inputs=x_in, outputs=pred) + model.compile(optimizer=optimizers.Adam(learning_rate=0.01), + loss=losses.categorical_crossentropy, + metrics=["acc"], + ) + val_gen = generator.flow(val_data.index, val_targets) + + + # train the model + result = model.fit( + train_gen, + epochs=100, + validation_data=val_gen, + verbose=2, + shuffle=False + ) + + # show an accuracy graph + sg.utils.plot_history(result) + + # Test the model on the testing data + test_gen = generator.flow(test_data.index, test_targets) + print("testing data accuracy given below: ") + model.evaluate(test_gen) + + # set up the tnse by getting the full dataset + all_nodes = df_targets.index + all_gen = generator.flow(all_nodes) + + embedding_model = Model(inputs=x_in, outputs=x_out) + emb = embedding_model.predict(all_gen) + X = emb.squeeze(0) + + # turn the data into 2 dimensions. + tsne = TSNE(n_components=2) + X_2 = tsne.fit_transform(X) + + # do an tsne plot + fig, ax = plt.subplots(figsize=(10, 10)) + ax.scatter(X_2[:, 0],X_2[:, 1],c=df_targets.squeeze(),cmap='turbo', + alpha=0.5) + ax.set( + title="TSNE visualization of GCN embeddings for facebook dataset" + ) + diff --git a/recognition/45249435/testing model.png b/recognition/45249435/testing model.png new file mode 100644 index 0000000000..8cfd42a56e Binary files /dev/null and b/recognition/45249435/testing model.png differ diff --git a/recognition/45249435/train_val accuracy.png b/recognition/45249435/train_val accuracy.png new file mode 100644 index 0000000000..bb9a82ad7f Binary files /dev/null and b/recognition/45249435/train_val accuracy.png differ diff --git a/recognition/45249435/tsne.png b/recognition/45249435/tsne.png new file mode 100644 index 0000000000..ddc6e98f72 Binary files /dev/null and b/recognition/45249435/tsne.png differ