shakes76 · shakes76 · Nov 23, 2021 · Oct 25, 2021 · Oct 25, 2021 · Oct 26, 2021
diff --git a/recognition/45249435/README.md b/recognition/45249435/README.md
@@ -0,0 +1,62 @@
+# Multilayer- GCN
+GCNs are nothing more than graph convolutional networks. This just refers to the type of data imported. In the case of CNN usually there is a 2D image and a filter
+and sequentially the filter moves through the image doing operations on what it sees. But now assume that your data is a graph where nodes are connect to each other
+via an edge. The reason why CNNs would not work in this case is because the graph representing the dataset is not part of a Euclidean space. This makes CNNs useless
+for datasets represented by graphs. In short, GCNs are just CNNs for sets of data represented with a graph. An example of such a dataset (and the dataset I will
+make a GCN on) is the [Facebook Large Page-Page Network dataset](https://snap.stanford.edu/data/facebook-large-page-page-network.html). I will classify the nodes of
+this dataset, that means that given the **128 of the features** of each webpage, I will predict whether or not they are part of one of the four classes:
+* tvshow
+* government 
+* company 
+* politician
+
+Given that this problem is supposed to be semi supervised, the dataset needs to be split accordingly. This means that the training and validation sets need to be
+significantly smaller than the testing set.
+
+
+## How it works
+1. The program imports and reads the data given as numpy arrays which are then changed to pandas dataframes
+2. The data is split into training validation and testing and because it is semi supervised the training and validation sets only have 500 points in it
+3. A graph like data structure is created from the set of data
+4. One hot encoding is used to get the categorical target variable to a numerical state
+5. The model is then initialised using arbitrary hyperparameters at first but later it is tuned for better accuracy
+6. The model is then trained and evaluated
+7. The evaluation consists of a graph showing the validation and training accuracy and loss for each epoch.
+8. Using the training data the model tries to predict the page outcome. 
+9. The model is then shrunk down to 2 dimensions and plotted using tsne
+
+## High level explanation of the algortihm
+Each node has a lot of features describing it and neighbouring nodes. Each node sends a message to each one of its neighbours with all the features it has. These 
+features from each neighbour are transformed using a linear operation (ie average). This is then put through a standard neural network layer and the ouput is then
+the new state of the node. This is done for every node in the graph. 
+
+## Train and validation graphs
+![train and validation accuracy](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/train_val%20accuracy.png)
+
+## TSNE embedded graph
+![TSNE](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/tsne.png)
+
+## Training the model
+![training of the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/Training_stage.png)
+
+## Testing the model
+![testing the model](https://raw.githubusercontent.com/Pentaflouride/PatternFlow/topic-recognition/recognition/45249435/testing%20model.png)
+
+## Other outputs
+Other outputs like shape of data and how I progressed in solving the problem is given in the notebook. Model.py is a refined version of the notebook wrapped in a
+function. The driver.py file runs the model.py file and gives the outputs given above (i.e it does not show any less important outputs like shapes of data). 
+
+## Usage
+Run the driver.py file to get all the main outputs given above. The driver.py file does not require any arguments. The driver file has a main function so it will
+run as soon driver.py is run. 
+
+## Driver Dependencies
+* Tensorflow
+* Sklearn
+* Keras
+* Pandas
+* Matplotlib
+* Stellargraph
+
+## Extra Notebook dependencies
+* Scipy
diff --git a/recognition/45249435/Training_stage.png b/recognition/45249435/Training_stage.png
diff --git a/recognition/45249435/driver.py b/recognition/45249435/driver.py
@@ -0,0 +1,18 @@
+import model
+
+def main():
+    """ running this file with start training the GCN model for the facebook dataset.
+        After training is done it will evaluate the validation accuracy.
+        After the validation accuracy is evaluated it will graph the validation
+        accuracy along with its loss and also the training accuracy and loss.
+        Lastly it will give a TSNE plot of the dataset and how it was evaluated given
+        specific colours. If there are more than 4 colours there was an error.
+        Also note that the TSNE plot might not match the colours given in the
+        README file since the models changes each time it is run and the colour
+        choices are random.
+    """
+    model.run_model()
+
+
+if __name__=="__main__":
+    main()
diff --git a/recognition/45249435/model.py b/recognition/45249435/model.py
@@ -0,0 +1,107 @@
+# Model.py by Paul Turculetu (GCN algortihm)
+# Feel free to use any of this code for any of your needs
+# November 2021
+# Final report
+# Training a GCN for the facebook dataset and producing a tsne
+
+import pandas as pd
+import numpy as np
+import stellargraph as sg
+from sklearn.model_selection import train_test_split
+from sklearn import preprocessing as pre
+from tensorflow.keras import layers, optimizers, losses, Model
+from stellargraph.mapper import FullBatchNodeGenerator
+from stellargraph.layer import GCN
+from sklearn.manifold import TSNE
+import matplotlib.pyplot as plt
+
+def run_model():
+    """loads and preprocess data, then it trains, evaluates and graphs a TSNE 
+        on the classification of nodes. It also graphs the error on the evaluation
+        and training dataset
+    """
+
+    # load the numpy arrays of the data given in the question
+    # also find out how many classes the target variable has
+    np_edges = np.load("edges.npy")
+    np_features = np.load("features.npy")
+    np_target = np.load("target.npy")
+
+    # store the data as dataframes also, give the columns proper names
+    # so things don't become confusion. Make data into a graph with edges and nodes
+    df_features = pd.DataFrame(np_features)
+    df_edges = pd.DataFrame(np_edges)
+    df_targets = pd.DataFrame(np_target)
+    df_edges.columns = ["source", "target"]
+    df_targets.columns = ["target"]
+    mat = sg.StellarGraph(df_features, df_edges)
+
+    # split the data into train, test and validation keeping in my that 
+    # the train and validation sets need to be significantly smaller than
+    # the testing set.
+    train_data, test_data = train_test_split(df_targets, train_size=500)
+    val_data, test_data = train_test_split(test_data, train_size=500)
+
+    # one hote encode the target datasets because right now each class is
+    # represented by a string
+    one_hot_target = pre.LabelBinarizer()
+    train_targets = one_hot_target.fit_transform(train_data['target'])
+    val_targets = one_hot_target.transform(val_data['target'])
+    test_targets = one_hot_target.transform(test_data['target'])
+
+    # initialize the model changing the hyper parameters to get
+    # better results
+    generator = FullBatchNodeGenerator(mat, method="gcn")
+    train_gen = generator.flow(train_data.index, train_targets)
+    gcn = GCN(
+        layer_sizes=[32, 32], activations=["relu", "relu"], generator=generator, dropout=0.2
+    )
+    x_in, x_out = gcn.in_out_tensors()
+    pred = layers.Dense(units=train_targets.shape[1], activation="softmax")(x_out)
+
+    # optimize the model using the adam optimizer
+    model = Model(inputs=x_in, outputs=pred)
+    model.compile(optimizer=optimizers.Adam(learning_rate=0.01),
+        loss=losses.categorical_crossentropy,
+        metrics=["acc"],
+    )
+    val_gen = generator.flow(val_data.index, val_targets)
+
+
+    # train the model
+    result = model.fit(
+        train_gen,
+        epochs=100,
+        validation_data=val_gen,
+        verbose=2,
+        shuffle=False
+    )
+
+    # show an accuracy graph
+    sg.utils.plot_history(result)
+
+    # Test the model on the testing data
+    test_gen = generator.flow(test_data.index, test_targets)
+    print("testing data accuracy given below: ")
+    model.evaluate(test_gen)
+
+    # set up the tnse by getting the full dataset
+    all_nodes = df_targets.index
+    all_gen = generator.flow(all_nodes)
+
+    embedding_model = Model(inputs=x_in, outputs=x_out)
+    emb = embedding_model.predict(all_gen)
+    X = emb.squeeze(0)
+
+    # turn the data into 2 dimensions.
+    tsne = TSNE(n_components=2)
+    X_2 = tsne.fit_transform(X)
+
+    # do an tsne plot
+    fig, ax = plt.subplots(figsize=(10, 10))
+    ax.scatter(X_2[:, 0],X_2[:, 1],c=df_targets.squeeze(),cmap='turbo',
+        alpha=0.5)
+    ax.set(
+        title="TSNE visualization of GCN embeddings for facebook dataset"
+    )
+
diff --git a/recognition/45249435/testing model.png b/recognition/45249435/testing model.png
diff --git a/recognition/45249435/train_val accuracy.png b/recognition/45249435/train_val accuracy.png
diff --git a/recognition/45249435/tsne.png b/recognition/45249435/tsne.png