TED videos with subtitles
Jul. 8th, 2013 09:21 amI love watching TED talks videos, however I am getting more and more deaf, so am finding it harder to catch all of what is being said, especially during shots that don't show the speaker's mouth. The TED talk download pages have transcripts embedded that include timing information, so I set about working out how to extract it and create srt format subtitles from it.
In the hope that I might be able to save myself some work I first looked around on the net to see if it had already been done. It has, but all examples that I found are really quite unsatisfactory. Generally requiring going to another site to get the subtitle file. There is a python script that nearly does what I want, but I couldn't get it to work as it required some libraries that I didn't have and a search didn't show me how to add them to python. (As an aside, my original enthusiasm for python has begun to fade. It has become increasingly bloated, unnecessarily complex and unwieldy, and new variants keep breaking old programs.)
I wanted something that I could simply point at a TED-talk page. The program would do the rest: It automatically downloads the video, renaming it to a more informative name then creates the .srt subtitle file with the same name so that playing the video will automatically display the subtitles.
It turned out to be easy to write, and uses only commands that are available to all Linux distributions. I use lots of comments so that I can come back months later and still work out how it operates, so hopefully you should be able to follow it. The comments give it the appearance of being bigger than it really is, often I broke up commands to multiple lines for the sake of clarity, commenting each, so that sometimes 1 line becomes about 20. It really is a simple program. Let me know if you come up with some improvements.
In the hope that I might be able to save myself some work I first looked around on the net to see if it had already been done. It has, but all examples that I found are really quite unsatisfactory. Generally requiring going to another site to get the subtitle file. There is a python script that nearly does what I want, but I couldn't get it to work as it required some libraries that I didn't have and a search didn't show me how to add them to python. (As an aside, my original enthusiasm for python has begun to fade. It has become increasingly bloated, unnecessarily complex and unwieldy, and new variants keep breaking old programs.)
I wanted something that I could simply point at a TED-talk page. The program would do the rest: It automatically downloads the video, renaming it to a more informative name then creates the .srt subtitle file with the same name so that playing the video will automatically display the subtitles.
It turned out to be easy to write, and uses only commands that are available to all Linux distributions. I use lots of comments so that I can come back months later and still work out how it operates, so hopefully you should be able to follow it. The comments give it the appearance of being bigger than it really is, often I broke up commands to multiple lines for the sake of clarity, commenting each, so that sometimes 1 line becomes about 20. It really is a simple program. Let me know if you come up with some improvements.
#!/bin/sh
# dlted
# by Miriam 2013-07-07
#
# given a TED talk download page, it
# gets the video
# renames it to the actual title
# downloads the transcript
# and converts it to .srt subtitle format
# finds the length of the intro and offsets the subtitles correctly
function show_options {
echo -e "\e[34musage: \e[35mdlted [novid]\e[34m"
echo -e " From webpage it downloads video, renames it to the title,"
echo -e " (or skips downloading the video if 'novid' keyword is given)"
echo -e " extracts transcript and creates .srt subtitle file.\e[0m"
}
if [ "$1" = "" -o "$1" = "-h" ]; then
show_options
exit
fi
# --------------------
# download the webpage
# --------------------
# check if web address
webcheck="${1::7}"
if [ "$webcheck" = "http://" ]; then
wget "$1"
name=`basename "$1"`
else
echo -e "\e[31m ERROR...not a webpage address\e[0m"
exit
fi
# --------------------------------
# download the video and rename it
# --------------------------------
# get the page title for a more informative name
# we'll use this name for the video and subtitle file
title=`sed -n '/altHeadline/p' "$name" | sed 's/.*>\([^<]*\)<.*/\1/; s/:/ -/'`
# get the download address for the video
video=`sed -n '/apikey=TEDDOWNLOAD/p' "$name" | sed 's/.*\(http.*mp4\).*/\1/'`
# examples make this next part clearer:
# $video="http://download.ted.com/talks/CamilleSeaman_2013.mp4"
# $vname="CamilleSeaman_2013.mp4"
# $vbase="CamilleSeaman_2013"
# $vext="mp4"
# $date="2013" (can't test for just digits as often has suffix letter)
vname="${video##*/}"
vbase="${vname%.*}"
vext="${vname##*.}"
date="${vbase##*_}"
# construct the new basenamename for the video and subtitle file
newname="${title}_${date}"
# give the option of skipping video download
# by following the address with keyword 'novid'
if [ ! "$2" = "novid" ]; then
wget "$video"
mv "$vname" "${newname}.${vext}"
fi
# ----------------------
# extract the transcript
# ----------------------
# get timing offset
intro=`sed -n '/introDuration%22%3A/p' "$name" | sed 's/.*introDuration%22%3A\([^%]*\)%.*/\1/'`
# normally I'd use bc for floating point calculations:
# offset=`echo "$intro*1000" | bc`
# but I've made precision default to 2 places after the decimal point
# and don't seem to be able to turn it off on-the-fly
# so I'll use dc (the reverse polish calculator) instead
offset=`echo "$intro 1000 * p" | dc`
# delete everything in the html file that is not transcript
# all subtitle lines contain "transcriptLink" (with the double quotes)
# all we have to do is save only those lines -- easy
sed -i -n '/"transcriptLink"/p' "$name"
# chop everything up-to and including the first occurrence of #
sed -i 's/^[^#]\+#//' "$name"
#put the timing number on its own line and strip the remainder of the html tag
sed -i 's/^\([0-9]\+\)"[^>]\+>/\1\n/' "$name"
# strip out the tags
sed -i 's/<\/a>//' "$name"
# check if transcript is empty file
if [ ! -s "$name" ]; then
echo -e "\e[31m No transcript -- can't make subtitles. Skipping.\e[0m"
exit
fi
# ------------------------------------
# convert transcript to .srt subtitles
# ------------------------------------
function convert_time () {
local tt=$1
let h=tt/36000000
let rh=tt%36000000
let m=rh/60000
let rm=rh%60000
let s=rm/1000
let rs=rm%1000
echo "${h}:${m}:${s},${rs}"
}
subname="${newname}.srt"
count="0"
# if subtitle file already exists delete it
if [ -e "$subname" ] ; then rm "$subname" ; fi
# read each line into variables $time and $line
cat "$name" |
(
while read time
do
read line
# first time through we need to save time and line so we can get time difference
if [ "$count" = "0" ]; then
let time=$time+$offset
time1="$time"
line1="$line"
count="1"
else
# offset to allow for intro
let time=$time+$offset
# convert from and to time values
t1=`convert_time $time1`
t2=`convert_time $time`
# write 4 lines out to subtitle file
# (item count, timing, previous line of text, blank line)
echo -e "$count\n$t1 --> $t2\n$line1\n" >>"$subname"
# save new values into old
time1="$time"
line1="$line"
# increment line counter
let count++
fi
done
# finished the loop, but the last line still hasn't been written
t1=`convert_time $time1`
# we don't have another time value so make one by adding 2 seconds
time=time1+2000
t2=`convert_time $time`
# write the last item (4 lines: count, timing, text, blank line)
echo -e "$count\n$t1 --> $t2\n$line\n" >>"$subname"
)
# get rid of the intermediate transcript file
rm "$name"
echo -e "\e[32m...done.\e[0m"
# I wrote a very simple, flexible, open-ended way to make sounds for any events
# 'got' signals with an unobtrusive sound that the job is complete
systemsound got