How to organize with lots of related data?

tomwhiteman · December 19, 2022, 11:33am

Hey guys,

For a university project I’m trying to see the relation oil production/consumption and crude oil price have on certain oil stocks, and I’m a bit confused about how to sort this data.

I basically have 4 datasets-
Oil production
Oil consumption
Crude oil price
Historical price of certain oil company stock

If I am trying to find a way these 4 tables relate, what is the recommended way of organizing the data? Should I manually combine all this data to a single Excel sheet (seems like the most straight-forward way) or is there a more efficient way to go about this.

I am brand new to PyTorch and data, so I apologise if this is a very basic question. Also, the data can basically get infinitely larger, by adding data from additional countries, other stock indexes, etc. So is there a way I can organize the data so it’s easy to add additional related data?

Finally, I have the month-to-month values for certain data (eg: oil production), and day-to-day values for other data (eg: oil price). What is the best way I can adjust the data to make up for this discrepancy?

Thanks in advance!

nivek · January 3, 2023, 9:05pm

Hi @tomwhiteman,

Thanks for posting here. These questions do not relate to PyTorch and you will likely find better responses elsewhere, such as on Stack Overflow.

I will provide a brief answer here as I have worked quite a bit on similar topics before:

What is the recommended way of organizing the data?

As long as you can easily read the data into Python (e.g. with pandas), it should be fine. Keep in mind that you are working with time series data and representation should reflect that. You can decide what to do with them after reading it into Python.

Finally, I have the month-to-month values for certain data (eg: oil production), and day-to-day values for other data (eg: oil price). What is the best way I can adjust the data to make up for this discrepancy?

Handling data of different frequencies is a common issue with time series data. You can either increase or decrease the frequency of certain data to match the other ones. The exact transformation (e.g. aggregating, averaging, picking first/last) should be chosen based on the economic context of the data.