miriam_e: from my drawing MoonGirl (Default)
[personal profile] miriam_e
I love watching TED talks videos, however I am getting more and more deaf, so am finding it harder to catch all of what is being said, especially during shots that don't show the speaker's mouth. The TED talk download pages have transcripts embedded that include timing information, so I set about working out how to extract it and create srt format subtitles from it.

In the hope that I might be able to save myself some work I first looked around on the net to see if it had already been done. It has, but all examples that I found are really quite unsatisfactory. Generally requiring going to another site to get the subtitle file. There is a python script that nearly does what I want, but I couldn't get it to work as it required some libraries that I didn't have and a search didn't show me how to add them to python. (As an aside, my original enthusiasm for python has begun to fade. It has become increasingly bloated, unnecessarily complex and unwieldy, and new variants keep breaking old programs.)

I wanted something that I could simply point at a TED-talk page. The program would do the rest: It automatically downloads the video, renaming it to a more informative name then creates the .srt subtitle file with the same name so that playing the video will automatically display the subtitles.

It turned out to be easy to write, and uses only commands that are available to all Linux distributions. I use lots of comments so that I can come back months later and still work out how it operates, so hopefully you should be able to follow it. The comments give it the appearance of being bigger than it really is, often I broke up commands to multiple lines for the sake of clarity, commenting each, so that sometimes 1 line becomes about 20. It really is a simple program. Let me know if you come up with some improvements.

#!/bin/sh

# dlted
# by Miriam  2013-07-07
#
# given a TED talk download page, it
#    gets the video
#    renames it to the actual title
#    downloads the transcript
#    and converts it to .srt subtitle format
#    finds the length of the intro and offsets the subtitles correctly


function show_options {
   echo -e "\e[34musage: \e[35mdlted  [novid]\e[34m"
   echo -e "  From webpage it downloads video, renames it to the title,"
   echo -e "  (or skips downloading the video if 'novid' keyword is given)"
   echo -e "  extracts transcript and creates .srt subtitle file.\e[0m"
}

if [ "$1" = "" -o "$1" = "-h" ]; then
   show_options
   exit
fi


# --------------------
# download the webpage
# --------------------
# check if web address
webcheck="${1::7}"
if [ "$webcheck" = "http://" ]; then
   wget "$1"
   name=`basename "$1"`
else
   echo -e "\e[31m ERROR...not a webpage address\e[0m"
   exit
fi


# --------------------------------
# download the video and rename it
# --------------------------------
# get the page title for a more informative name
# we'll use this name for the video and subtitle file
title=`sed -n '/altHeadline/p' "$name" | sed 's/.*>\([^<]*\)<.*/\1/; s/:/ -/'`

# get the download address for the video
video=`sed -n '/apikey=TEDDOWNLOAD/p' "$name" | sed 's/.*\(http.*mp4\).*/\1/'`

# examples make this next part clearer:
# $video="http://download.ted.com/talks/CamilleSeaman_2013.mp4"
# $vname="CamilleSeaman_2013.mp4"
# $vbase="CamilleSeaman_2013"
# $vext="mp4"
# $date="2013" (can't test for just digits as often has suffix letter)
vname="${video##*/}"
vbase="${vname%.*}"
vext="${vname##*.}"
date="${vbase##*_}"

# construct the new basenamename for the video and subtitle file
newname="${title}_${date}"

# give the option of skipping video download
# by following the address with keyword 'novid'
if [ ! "$2" = "novid" ]; then
   wget "$video"
   mv "$vname" "${newname}.${vext}"
fi

# ----------------------
# extract the transcript
# ----------------------
# get timing offset
intro=`sed -n '/introDuration%22%3A/p' "$name" | sed 's/.*introDuration%22%3A\([^%]*\)%.*/\1/'`
# normally I'd use bc for floating point calculations:
# offset=`echo "$intro*1000" | bc`
# but I've made precision default to 2 places after the decimal point
# and don't seem to be able to turn it off on-the-fly
# so I'll use dc (the reverse polish calculator) instead
offset=`echo "$intro 1000 * p" | dc`

# delete everything in the html file that is not transcript
# all subtitle lines contain "transcriptLink" (with the double quotes)
# all we have to do is save only those lines -- easy
sed -i -n '/"transcriptLink"/p'  "$name"

# chop everything up-to and including the first occurrence of #
sed -i 's/^[^#]\+#//'  "$name"

#put the timing number on its own line and strip the remainder of the html tag
sed -i 's/^\([0-9]\+\)"[^>]\+>/\1\n/'  "$name"

# strip out the  tags
sed -i 's/<\/a>//'  "$name"

# check if transcript is empty file
if [ ! -s "$name" ]; then
   echo -e "\e[31m No transcript -- can't make subtitles. Skipping.\e[0m"
   exit
fi

# ------------------------------------
# convert transcript to .srt subtitles
# ------------------------------------
function convert_time () {
   local tt=$1
   let h=tt/36000000
   let rh=tt%36000000
   let m=rh/60000
   let rm=rh%60000
   let s=rm/1000
   let rs=rm%1000
   echo "${h}:${m}:${s},${rs}"
}

subname="${newname}.srt"
count="0"

# if subtitle file already exists delete it
if [ -e "$subname" ] ; then rm "$subname" ; fi

# read each line into variables $time and $line
cat "$name" |
(
   while read time
   do
      read line
      # first time through we need to save time and line so we can get time difference
      if [ "$count" = "0" ]; then
         let time=$time+$offset
         time1="$time"
         line1="$line"
         count="1"
      else
         # offset to allow for intro
         let time=$time+$offset

         # convert from and to time values
         t1=`convert_time $time1`
         t2=`convert_time $time`

         # write 4 lines out to subtitle file
         # (item count, timing, previous line of text, blank line)
         echo -e "$count\n$t1 --> $t2\n$line1\n" >>"$subname"

         # save new values into old
         time1="$time"
         line1="$line"

         # increment line counter
         let count++

      fi
   done

   # finished the loop, but the last line still hasn't been written
   t1=`convert_time $time1`
   # we don't have another time value so make one by adding 2 seconds
   time=time1+2000
   t2=`convert_time $time`
   # write the last item (4 lines: count, timing, text, blank line)
   echo -e "$count\n$t1 --> $t2\n$line\n" >>"$subname"
)

# get rid of the intermediate transcript file
rm "$name"

echo -e "\e[32m...done.\e[0m"

# I wrote a very simple, flexible, open-ended way to make sounds for any events
# 'got' signals with an unobtrusive sound that the job is complete
systemsound got


Profile

miriam_e: from my drawing MoonGirl (Default)
miriam_e

December 2025

S M T W T F S
 123456
7 8 910 111213
1415 1617181920
21222324252627
28293031   

Style Credit

Expand Cut Tags

No cut tags
Page generated Dec. 28th, 2025 08:30 pm
Powered by Dreamwidth Studios