Text classification

neda_vida · February 16, 2021, 12:47pm

Hi dear friends. I have a dataset for text classification. This dataset is divided into two folders of train and test. Them, each of them is divided into two folders of negative and positive. And, in every positive and negative folders I have 1000 text file in which you can see a sentence.
And, also next to these, I have a notepad file, in which there are 56050 words with their vectors. I think it’s called dictionary.
Now, I should just classify them. I didn’t worked on this topic at all. I uploaded my dataset on google drive and uploaded it in google colab.

from google.colab import drive
drive.mount(’/content/drive’)
train_data=’/content/drive/MyDrive/data/train’
test_data=’/content/drive/MyDrive/data/test/’

print(f’Number of training examples: {len(train_data)}’)
print(f’Number of testing examples: {len(test_data)}’)

now, when i run this code, it gives me 33 for length. Why does it happen? What is wrong in my proedure. Please help me.
thank you

CedricLy · February 16, 2021, 1:49pm

Hi neda,

with len(train_data) you do not get the number of files in this directory but the umber of characters in your string.
train_data is still only the pathname.

neda_vida · February 16, 2021, 3:03pm

really thank you. Then, how can I reach number of sentences in my train_data?
according to what you said, I think I need a code to read a train data folder with its sub folders, but I don’t know how to do it. I appreciate it friends help me.
Thank you

CedricLy · February 17, 2021, 7:50am

import os

train_files = os.listdir(train_data) # returns list of file in pathargument || should give you folder ['positive','negative']
train_file_positive = os.listdir(train_data + train_data_files[0]) # return just one text file path

Depending the the file format for your text you need to open it.
For normal .txt files it will be something like:

with open(train_file_positive, 'rb') as f:
    sentence = f.read()

neda_vida · February 20, 2021, 6:35am

Thank you so much. I will try it and give you the output.
Many thanks