Welcome to this first individual project! On this occasion, I take on the role of a MLOps Engineer.
The client requested and delivered several datasets to us, with the objective that we process them and create a recommendation system.
Original rating datasets are saved in a drive in a drive on this link
Original platform datasets are saved in a drive in a drive on this link
In order to better structure the information in the data, we proceeded to:
For files of each platform (found in the 'datasets' folder): Amazon, Disney, Hulu, and Netflix.
-
A new id field was generated: Each new
idis composed of the first letter of the platform name, followed by the show_id already present in the datasets (example for Amazon titles =as123). -
Null values in the 'rating' field were replaced with the string “
G” ('general for all audiences'). -
The format of the 'date_added' column was changed to
YYYY-mm-dd. -
"All fields were converted to lowercase, no exceptions.
-
"Duration" field was divided into two:
duration_intandduration_type. The first field is anintand the second is astringindicating the duration unit of measurement: min (minutes) or season (seasons), which were all changed to singular form in order to facilitate search. -
A 'platform' column was added in order to facilitate the construction of the API and queries.
For rating files (from 1 to 8):
-
A new column was added to indicate which platform each ID in the 'movieId' column belongs to.
-
The 8 datasets were concatenated into a single one and exported, so that it could later be uploaded to GitHub after being loaded, since, due to the size of the datasets, it was not possible to deploy on Render.
-
The concatenated csv has been saved in a drive as dataset_ratings.csv
With the transformed datasets, they were made available to the client through an API built with FastAPI library, featuring various queries for the user:
-
Movie (not TV series, etc.) with the longest duration by year, platform, and duration type. Function name:
get_max_duration(year, platform, duration_type). Returns the name of the movie -
Number of movies (not TV series, etc.) by platform, with a score higher than XX in a given year. Function name:
get_score_count(platform, scored, year). Returns an 'int', with the total number of movies that meet the requested criteria. -
Number of movies (not TV series, etc.) by platform. Function name:
get_count_platform(platform). Returns an int, with the total number of movies for that platform. The platforms are named: Amazon, Netflix, Hulu, and Disney. -
Actor who appears most frequently by platform and year. Function name:
get_actor(platform, year). Returns the name of the actor who appears most frequently according to the given platform and year. -
The number of contents/products (all available streaming content) released by country and year. Function name:
prod_per_country(content_type, country, year). It returns the number of the specified content type (movie, series) per country and year in a dictionary with the variables 'country' (country name), 'year' (year), 'movie' (number of movies). -
Number of content/products (everything available for streaming, series, movies, etc) according to the given audience rating (for which audience the content was classified). Function name:
get_contents(rating). It returns the total number of contents with that audience rating.
The deployment was carried out through Render at the following link, with the project name: PI-MLOpsEngineer.
To obtain a first global overview of the datasets' structure, functions such as .shape, .dtype, .describe, .info, and .head were used. In order to observe a little more in detail, a box plot diagram was graphed, where several outliers were found in the "duration_int" column. Finally, to complement this analysis, the ProfileReport tool from pandas_profiling library was used, where it was possible to observe that there are no duplicate values. The platform "Hulu" contains the most null values, some columns such as "cast" do not contain any values. Additionally, it was observed that the movie with the lowest score was "filth" with a score of 0.5.
Due to Github does not deploy Pandas profiling reports, on these links you can find them for platform and ratings made with Profile Reports:
"Cosine Similarity" model was used for the recommendation system. The name of the function can be found on the same API interface as a seventh query, called get_recomendation(title), where you enter the title of a movie and it returns a list of 5 recommended movies based on that input title.
Due to the large size of the file generated after processing, in order to execute the deployment on Render, the number of records was reduced to 1000, which are saved in a .pickle file called similarity_matrix.pickle. Nevertheless, to show the function's operation, there is a list of 10 movies included in the program.


