This project will be awesome, only if I invest enough time.

Go to file

Ventilaar 69bf7026dd many things changed		2024-03-20 22:44:02 +01:00
.gitea/workflows	many things changed	2024-03-20 22:44:02 +01:00
ayta	many things changed	2024-03-20 22:44:02 +01:00
.dockerignore	optimize docker	2024-02-28 23:43:38 +01:00
.gitignore	added basic admin auth and simple rate limiting	2024-02-28 00:12:30 +01:00
Dockerfile	A bit faster container building	2024-02-29 20:41:53 +01:00
LICENSE	base upload of offline files	2024-02-24 22:22:59 +01:00
README.md	base upload of offline files	2024-02-24 22:22:59 +01:00
ad-hoc-test.py	base upload of offline files	2024-02-24 22:22:59 +01:00
make_celery.py	many things changed	2024-03-20 22:44:02 +01:00
requirements.txt	Add base WebSub support (not finished). Add orphaned videos view. Implement video reporting and managing. Some small changes	2024-03-13 00:13:57 +01:00
run.py	Add base WebSub support (not finished). Add orphaned videos view. Implement video reporting and managing. Some small changes	2024-03-13 00:13:57 +01:00

README.md

amazing-ytdlp-archive

This project will be awesome, only if I invest enough time. This software will replace my current cronjob yt-dlp archive service.

Partially inspired by hobune. While that project is amazing by it's own, it's just not scaleable.

The idea

The new setup will either be fully running in flask, including the task that checks the youtube channels every x hours. Or Flask will be used as the gui frontend, and a seperate script will do the channel archiving. I have not desided yet.

What currently works is that the gui frontend calls to a seperate database while a cronjob handles the downloading of new videos from a list of channels.

How it works currently(legacy)

In the legacy folder you will find files that are currently in my archive project. How it works is that I have a cronjob running every 6 hours what then runs yt-dlp with a config file. In that config file a channel list contains all the channels that yt-dlp needs to update. If a new video has been uploaded, yt-dlp will automatically download a 720p version of the video, all subtitles at that time (rip community captions, will not forget you) and a json file with all the rest of the metadata. Oh and also the thumbnail.

This works. But is very slow and uses lots of "API" calls to youtube, which will sometimes will get the IP blocked. This needs to be overhauled.

Things learned

Video playlists

Using the following url for a youtube channel in yt-dlp references to the uploads playlist for that perticular channel. Not using the videos path in this format will upload some videos that are not uploaded by this channel.

https://www.youtube.com/channel/UCdBK94H6oZT2Q7l0-b0xmMg/videos

This is why my archive is tainted with channels referencing only one video.

However, simply changing the second character C to the character U, so that you get the following string. You can reference the uploaded by this channel playlist directly.

https://www.youtube.com/playlist?list=UUdBK94H6oZT2Q7l0-b0xmMg

Both links reference the same thing. Uploaded videos by this channel.

Channel names

Creators change their names pretty often. But channel id's are static. So if you create a folder structure by channel name, you will get multiple folders from the same channel. If you swap the channel name to channel id. The folders will never change.

Storage structure

The following folder structure is pretty nice for using static scripts. The one drawback is that you can't search for video id's or titles. Because the search takes too long. This is mainly why we need a new system using a database.

./videos/{channel_id}/{upload_date}/{video_id}/video_title.mp4

For the new system using a blob like storage will be key. I had the following in mind. It will be an independant random key and not the YouTube video ID because I have notices that multiple real videos exist under the same key by uploaders who replace old videos.

-| data
 | - videos
   | - 128bit_random_id.mp4
 | - subtitles
   | - same_random_id_EN.srt
   | - same_random_id_DE.srt
 | - thumbnails
    | - 128bit_random_id.jpg

API things learned

Etag is useful

When we will call the api for 50 items in a playlist we also get an etag back. This is a sort of hash of the returned data. And thus if we filter the returned data based on etag we can eliminate most playlist calls. What I mean by that is that if we call the latest uploaded videos, and the etag is the same as the previous call. We can just not call the rest of the playlist. Saving API calls.

But sometimes we could call it anyways in case private videos became public. Since etag is only the last 50 videos. Saving the last etag in the database, and sending it along with the api call will result in a 304(not modified) back. In case nothing has been changed for stale youtube channels.

The etag is present in every return from the call and looks like this:

{
  "kind": "youtube#playlistItemListResponse",
  "etag": "pGXxpygXglKdyZa1rK8GZq0elbo",
  "nextPageToken": "EAAaBlBUOkNBVQ",
  "items": [
    {
    ...